ANSWERS - LAB 4
Chapter 4
1) In the regression equation, the y intercept is - 7642 which implies that if year and area were set to 0, production would be -7642000 metric tons (which is impossible, since production cannot be negative). The regression equation indicates that for each additional year, it is expected that 4214 more metric tons of cassava will be produced. For each 564.7 sq. meter incresase of cultivated land area, 1000 more tons of cassava will be produced. The R2 (adjusted) value says that 31.13 % of the variation in production can be explained by changes in area and year. Considering the t tests, we see that none of the parameter estimates are significant. If the outlier, case 17 is removed, then the regression equation is -11279+ 6.0346 Year + 0.797 area. This improved the explained variance - adjusted R2 = 0.828). Again, none of the individual parameters are significantly different from 0.
2) In the expected versus predicted plot, observation 17 (year 1982) is a potential influential case. In addition, the scatter might suggest a curvilinear relationship. Another way of interpreting the scatter would be to say that there is an increase in variance with an increase in predicted production (heteroscadasticity).
3) Studentized Residuals: these test whether case 1 causes a significant shift in the regression intercept and so should be considered an outlier. These potential outliers can be identified using the Bonferroni inequality (/n to determine if it is significant :
P value < 0.05 / 17 = 0.00294
From the output, one p value is below 0.00294, corresponding to case 17, which is a significant outlier.
Leverage: This is the potential for influence resulting from unusual X values. Any leverage between 0.2 and 0.5 is considered risky since too much of the sample's information about the relationship between Y and X comes from a single case. Therefore, data for 1961, 1965, 1974 and 1980 are risky, and 1981 should be avoided since its value is >0.05.
Cooks distance: This measures the influence on the model as a whole, rather than on a specific coefficient. It measures the change in Cassava production when the ith case is deleted. Cook's D identifies influential points if Di >1. From the output, no cases are >1, so there are no influential cases. However, if the size adjusted Di > 4/n is used, case 17 is influential. It is unlikely, thought that the size adjusted Cook's Distance would be used for such a small sample size.
DFBETAS: These measure the influence of the ith case on the Kth regression coefficient. If / DFBeta/ > 2 or 2/sq. root of n, the ith case is a problem. In the output no case is influential on the Kth regression coefficient at /DFBeta/ > 2 , but if the size adjusted criterion /DFBeta/ > 2 / sq. root n is used, case 16 is influential.
Partial Leverage Plot: Case 17 is an outlier but doesn't have strong influence on slopes since the case statistics for #17 are not significant.
Part 2: Regress production on year and production on area. (You can exclude observation 17). The best model comes from the regression on production on area excluding observation 17. The model equation is Production = 387.75 +1.6738 Area. R2 = 0.824. This model did not show any other observations as being influential cases when looking at DFBetas and cook's D. The implications of this model for forecasting are that an accurate prediction does not depend on the year, but on the area alone.
Ranger Rick: We took data on cassava production in central Africa to find out whether there were any factors that influence how much is produced in a given year. Performing a regression on the data we have, we obtain the equation -7642+ 4.214 Yr. +0.5647 Area. The intercept tells us about the background levels - in this case, production is negative, which is impossible. The R2 value tells us that area and year explain 31.13% of the variation in production. Since none of the parameter estimates are significantly different from 0, none of the factors alone can be used to predict cassava production; they have to be used in combination with one another. Also mention F test.
13) The idea here is to examine the case statistics to identify potential outliers. The instructions say to use size-adjusted cut-offs.
For cook's D, cutoff is 4/10 = 0.4
For DFBetas, cutoff is 2/sq. root (n ) = 0.63
For leverage, cutoff is 2K / n = 0.4
The only potential influential point is # 2, Biver. Although it is below the size adjusted cutoff for leverage, it is larger than the size adjusted cutoff for cook's D and both DFBetas.
14)
The key point here is that eliminating observation # 2 does not change your conclusions. The slope is still positive and statistically significant according to the t test. R2 also increases, but the basic conclusion remains the same: when SO2 increases, pulmonary symptoms in children increase.
Ranger Rick
What is the easiest way to explain our conclusions? Lay people will not understand F statistics, R2 or the slope. The lay person will understand this:
When the mean 2 week sO2 concentration increase by 10 micrograms per cubic meter, an increase of about 13 children out of 1000 (18 out of 1000 if you exclude observation # 2) will experience pulmonary symptoms.