STA242/ENV255 - KEY TO MIDTERM EXAM - SPRING 1999

The average grade was 81.7398/120, and the standard deviation was 11.4775. And to appease a certain trouble maker, the third quartile was 90, and the first quartile was 75. The high score was 109.

1) If one were on a witch hunt, one might find some negative skewness or increasing residual variance with increasing values of the predicted log range. The clumping to the left and the right is irrelevant. In any event there are not enough data points here to say definitively that there is skewness or that there is heteroskedasticity. There are only 39 data points here. Large residuals happen, and it would be interesting to look further at the point with the large negative residual, otherwise I would proceed without too much regret.

2) The null hypothesis is that all the slopes are zero. The alternative is that they are not all zero. Note that the alternative is not that the true slopes are all nonzero. To test the null hypothesis we look at the whole model F-test. The test statistic is 23.2926 which is statistically significant for any level of significance bigger than the p-value of <0.0001. There is sufficient evidence to reject the null hypothesis at any reasonable level of significance. The data suggests that at least one of the true slopes is not zero. The individual t-test are not irrelevant here, but they do not test the hypothesis in question.

3) The R-square value for Model 1 is 0.666279. This is the proportion of variation in log(range) that is explained by the regression on the three predictors. The adjusted R-square value is a goodness of fit index, it is not the proportion of explained variance.

4) In model 1 all the Variance inflation factors are less than 1.2. A very slight degree of multicollinearity is present, but it certainly isn't a problem here. In model 2 a couple of predictors have VIF's of 5 to 8. Multicollinearity is definitely present, but we need to examine if it is a problem. If we'd like to reject the hypothesis that the true slope on region*tree_height is zero, then we are being hindered by the presence of the multicollinearity. If there wasn't any multicollinearity, and all other things remained the same then the test statistic for this hypothesis would be more like a significant 4.0 and not an insignificant 0.53... Multicollinearity may be keeping us from rejecting this hypothesis in model 2.

5) A species of tree that only grows on an island has a limited range. Our regression model doesn't know that this species lives on an island, and is likely to overpredict the value of the log(range). This would give us a large negative residual. In the residual plot we see a point with a residual value of almost -4 (on the log scale!). This point is likely to be the species in question. It is.

6) Model 1 is nested within model 2. Model 2 includes the interaction terms that we are interested in. An appropriate test here is what we called the subset F-test. The test statistic is: ((38.06986-32.79789)/(35-33))/(32.79789/33) = 2.6522... Of the two individual tests for the slopes in question only one is significant (at 0.0311) and the other is entirely insignificant (this may be due to the aforementioned multicollinearity). The subset F-test may or may not be powerful enough to reject the hypothesis that the interactions region*log(acorn_size) and region*tree_height have no effect on log(range). We'd have to calculate a p-value or look in the appropriate table to be sure.

7) Either the multicollinearity made the real effect of log(acorn_size) seemingly go away, or the real effect comes from the interaction region*log(acorn_size), and in the presence of the true effect log(acorn_size) no longer looks important.

8) From the abstract we see that the researchers' original hypothesis is that species with larger acorns should have larger range. This is a one sided hypothesis. In model 1 we can test the hypothesis that the true slope on log(acorn_size) (in the model which already has tree_height and region) is zero versus the alternative that the true slope is positive (note: the log(.) function is monotone increasing). The two tailed p-value is 0.0247. The one tailed p-value is 0.0247/2=0.01235. For any significance level greater than 0.01235 the evidence suggests that the researchers' original hypothesis is correct. Note that one sided p-values are not always half the two tail p-value.

9) JMP tells us that it has coded the Atlantic species with a 1 and the California species with a -1. When the species is an Atlantic species the model is:

estimated log(range) = (6.595+1.375) + (0.044-0.016) tree_height + (0.237+0.473) log(acorn_size).

When the species is from California the model is:

estimated log(range) = (6.595-1.375) + (0.044+0.016) tree_height + (0.237-0.473) log(acorn_size).

We estimate the difference between the two regions to be:

2(1.375) - 2(0.016) tree_height + 2(0.473) log(acorn_size).

10) The first species in our data set is the first observation. We would estimate that log(range) = 6.56 + 0.028 (27) + 0.444 log(1.4) + 1.56 (1). To get an estimate of range just exponentiate this value.

11) For a California oak:

estimated log(range) = (6.595-1.375) + (0.044+0.016) tree_height + (0.237-0.473) log(acorn_size).

12) To get a handle on leverage and influence we need to know at least the predictor values if not the actual leverages, the DFBETAS, or Cook's-D. None of our plots give this information directly. Predictor values a long way away from the center of the predictors have the potential for leverage and influence. High leverage points with large residuals have influence. Our "big" residuals may have influence, but we don't really know because we don't know if those observations have high leverage.