Take-Home Midterm
SOLUTIONS
Exercise 1
Exercise 1.1
There are some options to answer this part assuming that we can infer
results as "typical":
If you understood "How many bugs per trap?", the answer would be
168.66667, that is the number of bugs (5060) divided by the number of
traps (30). The command would be (sum(x$count)/30). A confidence
interval for this answer could also be useful (137 to 199). You could
use several methods to get this confidence interval.
If you understood "How many bugs per day?", the answer would be
5060 as the best estimate for it. The command would be (sum(x$count)).
Also some graphical display could be accepted, such as a boxplot
for the totals, a histogram for the totals, or some other options as
long as they provided clear information about the question.
The number of bugs per half trap 84.333 wouldn't be as good
a description since the number is different for the upper and
lower halves of the trap.
Exercise 1.2
Some graphical displays could be, of course, a boxplot with the two
portions of the traps splitted, a histogram by blocks, some
scatterplot with different symbols depending on the portion... Several
posible answers for this question.
Exercises 1.3 and 1.4
Two possible ways to address this problem were accepted. Either, as
suggested, as a ratio between the two portions, or as a difference
between the portions.
If you computed the ratios, the best answer would be that, as an
average, there tends to be 2.5 more times as many bugs in the lower
portion than in the upper portion. You could get this result by
creating a new vector with the ratios between the two portions for
each trap and then getting the average. With this vector you can run a
one-sample t-test, bootstrapping,... and get the confidence
interval. If you ran a one-sample t-test, you would get a Confidence
interval of
Lower Bound | Upper Bound |
2.061 | 2.949 |
Of course bootstrapping or other possibilities could give you slightly
different intervals.
A logarithmic transformation of the data could also be done to
solve this problem and then running a paired-t-test with the
difference of logarithms.
Another option would have been to run the paired-t-test without
any transformation (lower portion-upper portion), but notice that the
key here is not to lose the information that the data comes in
pairs. All possible solutions must take into account that data comes
in pairs and should be treated in pairs. A two-sample t-test is,
therefore, not accurate to measure this problem.
One more option you could have used is the ANOVA. We have two
variables here that affect (or, at this stage, that can affect) the
number of bugs trapped: Trap and Height. You could have transformed
variable "trap" and variable "portion" to factor variables and run a
two-factor AOV, that would basically tell you the same, that both
variables are significant. You could do this in the commands windows
of splus with aov(x$count~as.factor(x$trap)+as.factor(x$height)) or
going to Statistics-Analysis of Variance-Fixed Effects and creating
the formula in the window. Don't forget that you need to treat the
variables as factors. If they're numeric, the computer won't tell you
whether you are doing things correctly or not (in this case for
example trap is numeric). You can change in windows the variables to
factors in Data-Change data type. For the commands window is
"as.factor".
Bootstrapping is the last possibility to create the confidence
interval.
If you did the differences instead of the ratios, you should have
gotten your C.I.s around a mean of 65.9333. Again the method you use
to determine the C.I. will give a different outcome, but it has to be
consistent with the characteristics of the data.
Exercise 1.5
We basically don't know anything about this study. The way the traps
were placed, the times, weather conditions,... and many confounding
variables could affect these results. We can observe some differences,
but extrapolation is too risky and won't have a good basis. Further
study is necessary and more information about several variables that
could affect the final outcome.
Exercise 2
This problem has to be addressed through an Analysis of Variance. You
could notice several elements in the data, so this solution is just
the basis of what was suggested as basic, major questions of
interest. Further study of other effects is also valid.
Exercise 2.1
This is a part of the exercise where you have to address with some
clearity what you want to study. There are several posibilities,
several elements that could be analyzed in this dataset, but there are
two basic elements that you should address:
Is there an island effect in the size of the leafs?
Is there a colony effect in the size of the leafs?
Other analysis are welcome, but these are required.
Exercise 2.2
Is there any difference between colonies in terms of the size of the
leafs they carry?
You could take two ways to solve this:
To analyze the colony effect through a one-way
AOV.
To analyze both the colony effect and the island effect at a time
in a two-way AOV.
If you did the one-way anova, you would have these results:
| d.f. |
S.S. | M.S. | F-value |
Pr(F) |
colony | 7 | 3.333777 | 0.4762538 | 2.445959 | 0.0369 |
Residuals | 36 | 7.009576 | 0.1947104 |
If you decided to do the two-way AOV, you would get something
like:
| d.f. |
S.S. | M.S. | F-value |
Pr(F) |
colony | 6 | 2.838212 | 0.4730353 | 2.429429 | 0.045 | island | 1 | 0.495565 | 0.495565 | 2.545138 | 0.119 |
Residuals | 36 | 7.009576 | 0.1947104 |
Don't forget again that you have to treat the colony as a factor or
you would do the wrong analysis. A good way to check it is to look at
the degrees of freedom and check if they're what they're supposed to
be (number of factors-1) for the variable/s you're treating.
Is there any island effect?
To answer whether there is an island effect or not, you could
have used the previous AOV table.
In case you decided to keep your analysis in a one-way level, you
should have done a one-way AOV to check if there is an island effect
or not. A graphical display only wouldn't be accepted. This question
required a more thorough analysis.
The one-way AOV results would be the following:
| d.f. |
S.S. | M.S. | F-value |
Pr(F) |
island | 1 | 0.4955650 | 0.4955650 | 2.113544 | 0.1534331 |
Residuals | 42 | 9.847787 | 0.2344711 |
We can see that there is no island effect but there is colony effect.
To do all the previous tables, you could either use the aov command
that I included in the answers for exercise 1 or, again, the window
Stat-AOV-Fixed effects and create the adequate formula.
Exercise 2.3
The scope of this study is very limited.
We can be fairly certain that there are differnces between
these particular colonies, but anything more general really
cannot be inferred. Particular interesting questions that
remain unanswered are: Is the difference between colonies
observed here typical of other colonies in the region?
How much of a role does the local environment play in harvesting
behavior? Do new colonies match the behavior of their parent
colonies?
You could have included some plots, like boxplots to see differences
between islands or colonies, 3-D plots to see differences in
concentration of data,... If you didn't it's not a minus.
Exercise 3
Exercise 3.1
No question about what you had to do for this part. AOV.
The command for the unix-workers would be
aov(x$yield~as.factor(x$block)+as.factor(x$variety)). For the
windows-lovers it would be Statistics-AOV-FE:
Formula=count~block+variety.
You should get this:
| d.f. |
S.S. | M.S. | F-value |
Pr(F) |
block | 3 | 21.16625 | 7.055417 | 6.241430 | 0.003371618 | variety | 7 | 29.16375 | 4.166250 | 3.685588 | 0.009411924 |
Residuals
| 21 | 23.73875 | 1.130417 |
Given these results you have to be consistant for the rest of the
problem. If, as we can see, there exists a variety effect even after
adjusting by the block effect, you're expected to consider the variety
effect throughout the analysis unless you prove that, at a certain
point, it doesn't exist. Therefore further analysis without
considering this effect are not accepted.
Exercise 3.2
These confidence intervals would easily be constructed following these
steps:
Statistics-AOV-Fixed Effects. Formula=yield~block+variety. Then go
to the Results subwindow and click on "Adjusted means." This way
you'll get the adjusted means with their respective standard
errors. Then you can compute the confidence intervals for the variety
effects after taking into account the block effect.
You'll get the following adjusted means and s.e:
| G |
V | RI | F |
P |
E | M | RE |
| 6.55 | 6.3 | 3.5 | 5.625 | 6.5 | 6.55 | 5.55 | 5.575 | S.E. | 0.5316 | 0.5316 | 0.5316 | 0.5316 | 0.5316 | 0.5316 | 0.5316 | 0.5316 |
And the intervals would be
| G |
V | RI | F |
P |
E | M | RE |
Lower
Bound | 5.4 | 5.2 | 2.4 | 4.5 | 5.4 | 5.4 | 4.4 | 4.5 | Upper
Bound | 7.7 | 7.4 | 4.6 | 6.7 | 7.6 | 7.7 | 6.7 | 6.7 |
You should have used to construct these intervals a t with 21 degrees
of freedom. t(21)=2.079614. To get this exact value you could have
used the qt(0.975,21) function or approximate it with a 20
d.f. t. Both possibilities were accepted.
Exercise 3.3
This question is direct to answer with the windows devices. If you go
to Statistics-Multiple Comparisons and select the correct
variables/tests, for the Tukey Method you'll get something like this
(plus some nice plots):
| Estimate | Std. Error | L. Bound | U. Bound |
E-F | 9.25e-001 | 0.752 | -1.600 | 3.450 |
E-G | 6.36e-015 | 0.752 | -2.520 | 2.520 |
E-M | 1.00e+000 | 0.752 | -1.520 | 3.520 |
E-P | 5.00e-002 | 0.752 | -2.470 | 2.570 |
E-RE | 9.75e-001 | 0.752 | -1.550 | 3.500 |
E-RI | 3.05e+000 | 0.752 | 0.528 | 5.570 |
E-V | 2.50e-001 | 0.752 | -2.270 | 2.770 |
F-G | -9.25e-001 | 0.752 | -3.450 | 1.600 |
F-M | 7.50e-002 | 0.752 | -3.450 | 2.600 |
F-P | -8.75e-001 | 0.752 | -3.400 | 1.650 |
F-RE | 5.00e-002 | 0.752 | -2.470 | 2.570 |
F-RI | 2.12e+000 | 0.752 | -0.397 | 4.650 |
F-V | -6.75e-001 | 0.752 | -3.200 | 1.850 |
G-M | 1.00e+000 | 0.752 | -1.520 | 3.520 |
G-P | 5.00e-002 | 0.752 | -2.470 | 2.570 |
G-RE | 9.75e-001 | 0.752 | -1.550 | 3.500 |
G-RI | 3.05e+000 | 0.752 | 0.528 | 5.570 |
G-V | 2.50e-001 | 0.752 | -2.270 | 2.770 |
M-P | -9.50e-001 | 0.752 | -3.470 | 1.570 |
M-RE | -2.50e-002 | 0.752 | -2.550 | 2.500 |
M-RI | 2.05e+000 | 0.752 | -0.472 | 4.570 |
M-V | -7.50e-001 | 0.752 | -3.270 | 1.770 |
P-RE | 9.25e-001 | 0.752 | -1.600 | 3.450 |
P-RI | 3.00e+000 | 0.752 | 0.478 | 5.520 |
P-V | 2.00e-001 | 0.752 | -2.320 | 2.720 |
RE-RI | 2.08e+000 | 0.752 | -0.447 | 4.600 |
RE-V | -7.25e-001 | 0.752 | -3.250 | 1.800 |
RI-V | -2.80e+000 | 0.752 | -5.320 | -0.278 |
What can we see in this table? We can clearly see that there are 4
confidence intervals that don't include 0. This means that the
difference is (at a 95% confidence) clear between the varieties
compared. See that all 4 combinations have variety RI on it, showing
that this variety has a clearly lower productivity.
Exercise 3.4
We have at this point a column that is affecting the production,
given the hint in question 3.4.
What can we do with this problem? How can we isolate and analize the
hedge effect? There are several ways to do this, and I'll introduce
some of them. If you wrote something different but consistent, it may
be right. This is just a bunch of possible answers.
Possibility 1
You could analize the hedge effect through a three-way AOV. You
could do this assigning in a new vector a value X to the cells outside
the edge and a value Y to the cells beside the edge.
Then you could treat all variables (variety, block and hedge) as
factors and develop your 3-way anova. This would be in unix with
aov(x$yield~x$variety+x$block+x$hedge). You can introduce the formula
in the window AOV and get the same results.
Possible orders to introduce the variables: both of
them has a different meaning in terms of the order in which the
effects are been adjusted:
1A) yield~block+hedge+variety
1B) yield~hedge+block+variety
A correct explanation of the results given the model selected is a
plus.
1A | d.f. |
S.S. | M.S. | F-value |
Pr(F) |
block | 3 | 21.16625 | 7.05542 | 12.15305 | 0.00009477 | hedge | 1 | 26.70083 | 26.70083 | 45.99255 | 0.00000135 |
variety | 7 | 14.59073 | 2.08439 | 3.59039 | 0.01149494 |
Residuals
| 20 | 11.61094 | 0.58055 |
1B | d.f. |
S.S. | M.S. | F-value |
Pr(F) |
hedge | 1 | 42.35161 | 42.35161 | 72.95123 | 0.00000004 | block | 3 | 5.51548 | 1.83849 | 3.16683 | 0.04689609 |
variety | 7 | 14.59073 | 2.08439 | 3.59039 | 0.01149494 |
Residuals
| 20 | 11.61094 | 0.58055 |
Especially interesting is AOV 1B. We can see that after adjusting the
hedge effect, the block effect seems to vanish in a certain
degree. Enough to consider that there isn't a real block effect apart
from the hedge effect? That could lead to discussion, but it's good to
point out this circumstance.
Possibility 2
You could also have considered a hedge effect by blocks. If you
accept that there exist a block effect (if you assume that this block
effect is independent from the hedge effect), the analysis should be
done by creating two more blocks and analizing the block+variety
effect in this new 6 blocks.
This could be done by creating a new vector that assigned the hedge
elements of the second block to a fifth block and the hedge elements
of the fourth block to a sixth block. Then the analysis is
straightforward. This method lets you incorporate the hedge effect as
a part of the block effect.
You could get something like this:
yield~newblock+variety
2 | d.f. |
S.S. | M.S. | F-value |
Pr(F) |
variety | 7 | 29.16375 | 4.166250 | 7.31128 | 0.0002566519 | newblock | 5 | 34.07806 | 6.815612 | 11.96060 | 0.0000247737 |
Residuals
| 19 | 10.82694 | 0.569839 |
Possibility 3
You could get rid of the hedge data and continue with the
two-way analysis. If you want variety effects and get rid of the data
on the hedge, you just have to get rid of the block effect to get the
net variety effect (and the adjusted means to continue with the
analysis).
You would have to create a new vector with all the data except the
ones on the border. Make sure that your new number of observations is
28 and that your new factor variables are really factors.
In this case you would get something like this:
newyield~newbl+variety
3 | d.f. |
S.S. | M.S. | F-value |
Pr(F) |
newbl | 3 | 6.25381 | 2.084603 | 3.580587 | 0.03584085 | variety | 7 | 12.57850 | 1.796929 | 3.086468 | 0.02724999 |
Residuals
| 17 | 9.89733 | 0.582196 |
Given these results the conclusion should be to go back to
question 2 and solve it again but this time taking into account the
hedge effect. This effect clearly exists (all p-values show it), so
the solution should be to take this effect into account and eliminate
it to get the real adjusted variety effect that is not biased for any
uncounted variable.
Exercise 3.5
In order to be consistant with the analysis, you should have gone back
to question 3.2 and used the information you got in 3.4 to reformulate
the confidence intervals and give the most accurate answer possible to
3.5. It doesn't make sense to account the existance of a problem and
ignore it in the following answer. Therefore the recommendations for
the farmer should be based on your results for 3.4 and not on your
results for 3.2.
Variety P seems to be the best option for the farmer, and variety RI
seems to be the worst. So I'd recommend P. Of course, the farmer's
conditions aren't going to match those of the experiment exactly.
Recommending a variety is usually something that takes repeated
trials over a number of seasons and locations.
Of course, don't plant your strawberries next to the hedge!