Goto Lab:

Take-Home Midterm

SOLUTIONS

Exercise 1

Exercise 1.1
There are some options to answer this part assuming that we can infer results as "typical":

If you understood "How many bugs per trap?", the answer would be 168.66667, that is the number of bugs (5060) divided by the number of traps (30). The command would be (sum(x$count)/30). A confidence interval for this answer could also be useful (137 to 199). You could use several methods to get this confidence interval.

If you understood "How many bugs per day?", the answer would be 5060 as the best estimate for it. The command would be (sum(x$count)).

Also some graphical display could be accepted, such as a boxplot for the totals, a histogram for the totals, or some other options as long as they provided clear information about the question.

The number of bugs per half trap 84.333 wouldn't be as good a description since the number is different for the upper and lower halves of the trap.

Exercise 1.2

Some graphical displays could be, of course, a boxplot with the two portions of the traps splitted, a histogram by blocks, some scatterplot with different symbols depending on the portion... Several posible answers for this question.

Exercises 1.3 and 1.4
Two possible ways to address this problem were accepted. Either, as suggested, as a ratio between the two portions, or as a difference between the portions.

If you computed the ratios, the best answer would be that, as an average, there tends to be 2.5 more times as many bugs in the lower portion than in the upper portion. You could get this result by creating a new vector with the ratios between the two portions for each trap and then getting the average. With this vector you can run a one-sample t-test, bootstrapping,... and get the confidence interval. If you ran a one-sample t-test, you would get a Confidence interval of

Lower Bound	Upper Bound
2.061	2.949

Of course bootstrapping or other possibilities could give you slightly different intervals.

A logarithmic transformation of the data could also be done to solve this problem and then running a paired-t-test with the difference of logarithms.

Another option would have been to run the paired-t-test without any transformation (lower portion-upper portion), but notice that the key here is not to lose the information that the data comes in pairs. All possible solutions must take into account that data comes in pairs and should be treated in pairs. A two-sample t-test is, therefore, not accurate to measure this problem.

One more option you could have used is the ANOVA. We have two variables here that affect (or, at this stage, that can affect) the number of bugs trapped: Trap and Height. You could have transformed variable "trap" and variable "portion" to factor variables and run a two-factor AOV, that would basically tell you the same, that both variables are significant. You could do this in the commands windows of splus with aov(x$count~as.factor(x$trap)+as.factor(x$height)) or going to Statistics-Analysis of Variance-Fixed Effects and creating the formula in the window. Don't forget that you need to treat the variables as factors. If they're numeric, the computer won't tell you whether you are doing things correctly or not (in this case for example trap is numeric). You can change in windows the variables to factors in Data-Change data type. For the commands window is "as.factor".

Bootstrapping is the last possibility to create the confidence interval.

If you did the differences instead of the ratios, you should have gotten your C.I.s around a mean of 65.9333. Again the method you use to determine the C.I. will give a different outcome, but it has to be consistent with the characteristics of the data.

Exercise 1.5

We basically don't know anything about this study. The way the traps were placed, the times, weather conditions,... and many confounding variables could affect these results. We can observe some differences, but extrapolation is too risky and won't have a good basis. Further study is necessary and more information about several variables that could affect the final outcome.

Exercise 2

This problem has to be addressed through an Analysis of Variance. You could notice several elements in the data, so this solution is just the basis of what was suggested as basic, major questions of interest. Further study of other effects is also valid.

Exercise 2.1

This is a part of the exercise where you have to address with some clearity what you want to study. There are several posibilities, several elements that could be analyzed in this dataset, but there are two basic elements that you should address:

Is there an island effect in the size of the leafs?

Is there a colony effect in the size of the leafs? Other analysis are welcome, but these are required.

Exercise 2.2
Is there any difference between colonies in terms of the size of the leafs they carry? You could take two ways to solve this:

To analyze the colony effect through a one-way AOV.

To analyze both the colony effect and the island effect at a time in a two-way AOV.

If you did the one-way anova, you would have these results:

	d.f.	S.S.	M.S.	F-value	Pr(F)
colony	7	3.333777	0.4762538	2.445959	0.0369
Residuals	36	7.009576	0.1947104

If you decided to do the two-way AOV, you would get something like:

	d.f.	S.S.	M.S.	F-value	Pr(F)
colony	6	2.838212	0.4730353	2.429429	0.045
island	1	0.495565	0.495565	2.545138	0.119
Residuals	36	7.009576	0.1947104

Don't forget again that you have to treat the colony as a factor or you would do the wrong analysis. A good way to check it is to look at the degrees of freedom and check if they're what they're supposed to be (number of factors-1) for the variable/s you're treating. Is there any island effect?

To answer whether there is an island effect or not, you could have used the previous AOV table.

In case you decided to keep your analysis in a one-way level, you should have done a one-way AOV to check if there is an island effect or not. A graphical display only wouldn't be accepted. This question required a more thorough analysis. The one-way AOV results would be the following:

	d.f.	S.S.	M.S.	F-value	Pr(F)
island	1	0.4955650	0.4955650	2.113544	0.1534331
Residuals	42	9.847787	0.2344711

We can see that there is no island effect but there is colony effect.

To do all the previous tables, you could either use the aov command that I included in the answers for exercise 1 or, again, the window Stat-AOV-Fixed effects and create the adequate formula.

Exercise 2.3

The scope of this study is very limited. We can be fairly certain that there are differnces between these particular colonies, but anything more general really cannot be inferred. Particular interesting questions that remain unanswered are: Is the difference between colonies observed here typical of other colonies in the region? How much of a role does the local environment play in harvesting behavior? Do new colonies match the behavior of their parent colonies?

You could have included some plots, like boxplots to see differences between islands or colonies, 3-D plots to see differences in concentration of data,... If you didn't it's not a minus.

Exercise 3

Exercise 3.1

No question about what you had to do for this part. AOV. The command for the unix-workers would be aov(x$yield~as.factor(x$block)+as.factor(x$variety)). For the windows-lovers it would be Statistics-AOV-FE: Formula=count~block+variety. You should get this:

	d.f.	S.S.	M.S.	F-value	Pr(F)
block	3	21.16625	7.055417	6.241430	0.003371618
variety	7	29.16375	4.166250	3.685588	0.009411924
Residuals	21	23.73875	1.130417

Given these results you have to be consistant for the rest of the problem. If, as we can see, there exists a variety effect even after adjusting by the block effect, you're expected to consider the variety effect throughout the analysis unless you prove that, at a certain point, it doesn't exist. Therefore further analysis without considering this effect are not accepted.

Exercise 3.2

These confidence intervals would easily be constructed following these steps:

Statistics-AOV-Fixed Effects. Formula=yield~block+variety. Then go to the Results subwindow and click on "Adjusted means." This way you'll get the adjusted means with their respective standard errors. Then you can compute the confidence intervals for the variety effects after taking into account the block effect. You'll get the following adjusted means and s.e:

	G	V	RI	F	P	E	M	RE
	6.55	6.3	3.5	5.625	6.5	6.55	5.55	5.575
S.E.	0.5316	0.5316	0.5316	0.5316	0.5316	0.5316	0.5316	0.5316

And the intervals would be

	G	V	RI	F	P	E	M	RE
Lower Bound	5.4	5.2	2.4	4.5	5.4	5.4	4.4	4.5
Upper Bound	7.7	7.4	4.6	6.7	7.6	7.7	6.7	6.7

You should have used to construct these intervals a t with 21 degrees of freedom. t(21)=2.079614. To get this exact value you could have used the qt(0.975,21) function or approximate it with a 20 d.f. t. Both possibilities were accepted.

Exercise 3.3

This question is direct to answer with the windows devices. If you go to Statistics-Multiple Comparisons and select the correct variables/tests, for the Tukey Method you'll get something like this (plus some nice plots):

	Estimate	Std. Error	L. Bound	U. Bound
E-F	9.25e-001	0.752	-1.600	3.450
E-G	6.36e-015	0.752	-2.520	2.520
E-M	1.00e+000	0.752	-1.520	3.520
E-P	5.00e-002	0.752	-2.470	2.570
E-RE	9.75e-001	0.752	-1.550	3.500
E-RI	3.05e+000	0.752	0.528	5.570
E-V	2.50e-001	0.752	-2.270	2.770
F-G	-9.25e-001	0.752	-3.450	1.600
F-M	7.50e-002	0.752	-3.450	2.600
F-P	-8.75e-001	0.752	-3.400	1.650
F-RE	5.00e-002	0.752	-2.470	2.570
F-RI	2.12e+000	0.752	-0.397	4.650
F-V	-6.75e-001	0.752	-3.200	1.850
G-M	1.00e+000	0.752	-1.520	3.520
G-P	5.00e-002	0.752	-2.470	2.570
G-RE	9.75e-001	0.752	-1.550	3.500
G-RI	3.05e+000	0.752	0.528	5.570
G-V	2.50e-001	0.752	-2.270	2.770
M-P	-9.50e-001	0.752	-3.470	1.570
M-RE	-2.50e-002	0.752	-2.550	2.500
M-RI	2.05e+000	0.752	-0.472	4.570
M-V	-7.50e-001	0.752	-3.270	1.770
P-RE	9.25e-001	0.752	-1.600	3.450
P-RI	3.00e+000	0.752	0.478	5.520
P-V	2.00e-001	0.752	-2.320	2.720
RE-RI	2.08e+000	0.752	-0.447	4.600
RE-V	-7.25e-001	0.752	-3.250	1.800
RI-V	-2.80e+000	0.752	-5.320	-0.278

What can we see in this table? We can clearly see that there are 4 confidence intervals that don't include 0. This means that the difference is (at a 95% confidence) clear between the varieties compared. See that all 4 combinations have variety RI on it, showing that this variety has a clearly lower productivity.

Exercise 3.4

We have at this point a column that is affecting the production, given the hint in question 3.4. What can we do with this problem? How can we isolate and analize the hedge effect? There are several ways to do this, and I'll introduce some of them. If you wrote something different but consistent, it may be right. This is just a bunch of possible answers.
Possibility 1

You could analize the hedge effect through a three-way AOV. You could do this assigning in a new vector a value X to the cells outside the edge and a value Y to the cells beside the edge. Then you could treat all variables (variety, block and hedge) as factors and develop your 3-way anova. This would be in unix with aov(x$yield~x$variety+x$block+x$hedge). You can introduce the formula in the window AOV and get the same results.

Possible orders to introduce the variables: both of them has a different meaning in terms of the order in which the effects are been adjusted: 1A) yield~block+hedge+variety 1B) yield~hedge+block+variety

A correct explanation of the results given the model selected is a plus.

1A	d.f.	S.S.	M.S.	F-value	Pr(F)
block	3	21.16625	7.05542	12.15305	0.00009477
hedge	1	26.70083	26.70083	45.99255	0.00000135
variety	7	14.59073	2.08439	3.59039	0.01149494
Residuals	20	11.61094	0.58055

1B	d.f.	S.S.	M.S.	F-value	Pr(F)
hedge	1	42.35161	42.35161	72.95123	0.00000004
block	3	5.51548	1.83849	3.16683	0.04689609
variety	7	14.59073	2.08439	3.59039	0.01149494
Residuals	20	11.61094	0.58055

Especially interesting is AOV 1B. We can see that after adjusting the hedge effect, the block effect seems to vanish in a certain degree. Enough to consider that there isn't a real block effect apart from the hedge effect? That could lead to discussion, but it's good to point out this circumstance.
Possibility 2

You could also have considered a hedge effect by blocks. If you accept that there exist a block effect (if you assume that this block effect is independent from the hedge effect), the analysis should be done by creating two more blocks and analizing the block+variety effect in this new 6 blocks. This could be done by creating a new vector that assigned the hedge elements of the second block to a fifth block and the hedge elements of the fourth block to a sixth block. Then the analysis is straightforward. This method lets you incorporate the hedge effect as a part of the block effect. You could get something like this: yield~newblock+variety

2	d.f.	S.S.	M.S.	F-value	Pr(F)
variety	7	29.16375	4.166250	7.31128	0.0002566519
newblock	5	34.07806	6.815612	11.96060	0.0000247737
Residuals	19	10.82694	0.569839

Possibility 3

You could get rid of the hedge data and continue with the two-way analysis. If you want variety effects and get rid of the data on the hedge, you just have to get rid of the block effect to get the net variety effect (and the adjusted means to continue with the analysis). You would have to create a new vector with all the data except the ones on the border. Make sure that your new number of observations is 28 and that your new factor variables are really factors. In this case you would get something like this: newyield~newbl+variety

3	d.f.	S.S.	M.S.	F-value	Pr(F)
newbl	3	6.25381	2.084603	3.580587	0.03584085
variety	7	12.57850	1.796929	3.086468	0.02724999
Residuals	17	9.89733	0.582196

Given these results the conclusion should be to go back to question 2 and solve it again but this time taking into account the hedge effect. This effect clearly exists (all p-values show it), so the solution should be to take this effect into account and eliminate it to get the real adjusted variety effect that is not biased for any uncounted variable.

Exercise 3.5

In order to be consistant with the analysis, you should have gone back to question 3.2 and used the information you got in 3.4 to reformulate the confidence intervals and give the most accurate answer possible to 3.5. It doesn't make sense to account the existance of a problem and ignore it in the following answer. Therefore the recommendations for the farmer should be based on your results for 3.4 and not on your results for 3.2.

Variety P seems to be the best option for the farmer, and variety RI seems to be the worst. So I'd recommend P. Of course, the farmer's conditions aren't going to match those of the experiment exactly. Recommending a variety is usually something that takes repeated trials over a number of seasons and locations.

Of course, don't plant your strawberries next to the hedge!