Homework 3
Part 1. LACK OF FIT
The linear regression predicting the log of catfish
survival from the log of chemical concentration is:
LOGSURV = 7.491 - 0.998 LOGCONC
The model appears to fit the data very well with an
R-squared value of 0.98 and an overall F-statistic of
667.73 which leads to a p-value of 0.001, indicating
that we can reject the null hypothesis (Beta(1) = 0) at
p < 0.05. The residuals appear to be iid from the
residual graph and they appear to be normally
distributed from the Quantile-Normal plot.
In order to compute the ANOVA table for the Lack of Fit
test for this model: we get the Pure Error SS using the
SAS code in the Lack of FIt web page. We subtract this
Pure Error SS from the Residual SS in the ANOVA table
from the regression to get the Lack of Fit SS.
Dividing by the relavent df, we obtain the following
ANOVA table:
|
df |
SS |
MS |
F |
Residual |
16 |
0.0501 |
Lack of Fit |
4 |
0.0025 |
0.0006 |
0.158 |
Pure Error |
12 |
0.0476 |
0.004 |
Comparing F = 0.158 at df 4,12 to the value at alpha
of 0.05 in Table A4.2, we find 0.158 << 3.26, so we
find no significant lack of fit in the linear model.
(This agrees with the impression you would get from
just looking at the data!)
Part 2. RWG CHAPTER 3: 1-9
(1) The multi-variable regression of percent DAMAGED
trees predicted from the dummy variable LOCATION and
site ELEVATION gives the following regression equation:
DAM = -60.153 + 49.78 LOC + 0.0568 EL
b(0) = -60.153 is the Y-intercept for south sites (LOC = 0)
b(0) + b(1) = -10.373 is the Y-intercept for north sites (LOC = 1)
b(2) = 0.0568 is the slope of both lines
The values for the Y-intercept are obviously meaningless (-60% of spruce trees are damaged?!). This is because the Y-intercept (interpretted as %damage at sea-level is far outside of the range of data (the lowest elevation is 670 m). This shows the danger of extrapolating beyond the data. In fact, over most of the area studied, red spruce doesn't occur at sea level!
The R^2 adj. for the regression is .36 suggesting that only 36% of the variance in spruce damage is explained by the model. Note that, as we expect, the unadjusted R-squared value is higher, because it doesn't account for the effects of added variables.
(2) The first three null hypotheses are tested by t-tests in the Parameter Estimate table in the regression output:
|
Prob >|T| |
Beta(0) = 0 |
0.0011 |
Beta(1) = 0 |
0.0001 |
Beta(2) = 0 |
0.0001 |
All three hypotheses can be rejected at the p = 0.05 level. We interpret the rejection of Beta(1) = 0 as follows:
The difference in Y-intercept between southern and northern sites is significant.
The final null hypothesis (Beta(1) = Beta(2) = 0) is addressed by the overall F-test in the regression output and can be rejected at the p < 0.05 level (Prob > F: 0.0001). This means that the regression equation provides significantly more information than the mean red spruce damage percentage.
(3a) The easiest way to get confidence intervals for the regression coefficients is in the regression output window under Tables (but you should also be able to derive estimates using the formulas on page 78 of RWG!).
95% CI:
33.29 < LOC < 66.27
0.02 < EL < 0.09
99% CI:
27.86 < LOC < 71.70
0.03 < EL <0.08
(3b) We discussed three ways to get the standard error of the mean for a predicted Y-value:
(a) using the PROC REG function in regular SAS (from SE of the mean web page)
(b) adding the variance and covariance of each b(i), taken from the variance/covariance matrix (from class 2/12)
(c) multiplying by the row vector of X values (C = [1 1 1500]) by the var/covar matrix and multiplying this by C' (which is just C written as a vertical column) (from the homework web page and eqn. 3.26)
You will get a different value for SE(Y-hat) depending on which approach you use because the variance/covariance table given by SAS Insight rounds off to four decimal places.
The SE(Y-hat) determined using approaches (a) is ~7.82
The SE(Y-hat) detemined using approaches (b) and (c) is ~12.53
Y-hat = -10.373 + 0.0568(1500) = 74.83
df = n-k = 62 = ~60
95% CI: 74.83 +/- 2.00(7.8 or 12.5)
99% CI: 74.83 +/- 2.66(7.8 or 12.5)
The confidence intervals can therefore be quite different depending which approach you used. This illustrates the danger of rounding errors. In interpretting these confidence intervals, note that EL = 1500 is outside of the range of data for northern sites, so we might have the same problem we had in question 1.
(4) Dummy intercept variables allow seperate regression lines to be drawn for each of the two locations. If you plotted the southern and northern sites as different symbols in your scatterplot, you should note that the two lines this regression equation gives cut through the middle of the cloud of points from each type of stand. Also note that the lines are parallel. The distance between them is the difference between the average values of sites at the two locations.
(5) The southern sites seem to have a weakly negative relationship between elevation and percent damage, while the northern sites appear to have a strongly positive relationship.
(6) The regression model including slope and intercept dummy variables is:
DAM = 37.28 - 78.62(LOC) - 0.017(EL) + 0.108(LOCEL)
The adjusted R-squared for this model is .52 (a marked improvement over the fit of the earlier model (R^2 adj = .36). A partial F-test reveals that both LOC and LOCEL are significant coefficients:
F = ((27809-12801)/2)/12801/60 = 35.17 [by RWG eqn 3.28]
df = 2,60
35.17 >> 4.98 [from F-table]
Therefore, reject the null hypothesis that Beta(1) = Beta(3) = 0. The model is improved by including the two dummy variables.
The regression equation is:
DAMAGE AT SOUTHERN SITES = 37.28 - 0.0172(EL)
DAMAGE AT NORTHERN SITES = -41.34 + 0.09(EL)
As we suspected in question 5, the northern sites have a strongly positive slope. Southern locations appear to have a weakly negative slope, but see below for more information...
(7) Running seperate regressions on the northern and southern sites gives us the same equations as the model with the dummy variables. Note that the slope parameter is significant for the North sites, but the slope parameter for the south sites is NOT significant. Thus, the weak negative slope from question 6 is not significantly different from 0: damage at southern sites is not significantly influenced by elevation.
The advantage of having a single model that includes both locations is that we can statistically test differences in the slopes and intercepts of the two lines and therefore get a better sense of how they differ. For example, The coefficient on the LOCEL variable is significant (by a t-test) which indicates that the difference in slopes between the southern and northern sites is significant. See page 89 in RWG for interpretations of the 2 slope parameters.
(8) Unlike the lines we drew in question 4, the difference between the relationship of elevation and percent damage between northern and southern sites is now clear (i.e., the northern sites have a different slope and intercept than the southern sites: see equations in question 6).
(9) The residuals appear to be iid with constant variance from the residual plot, although it seems that the southern sites have a little less variance around the line than the northern sites.
OVERALL INTERPRETATION:
The model with the two dummy variables seems to fit the data very well (high adj. R^2, residuals fit model assumptions, model and dummy variables are significant by F-tests). Interpretting the model as it applies to the northern sites: Red spruce at higher elevation sites in the north appear to suffer more damage. Specifically, over the elevation range we have data for, we would expect 9% more spruce trees to be damaged for every 100m increase in elevation. In the south, elevation does not significantly affect the percent of damaged spruce trees (recall that the slope for the linear regression on southern sites in question 7 was not significant). Overall, northern locations had the relationship between elevation and percent damage we expected at the outset (recall that deposition of pollutants is heavier at high elevations), while southern sites did not show this trend. We can not attribute the CAUSE of spruce decline to pollution, even at northern sites, however, because other factors we haven't considered (such as winter frost damage, etc.) could be influenceing the relationship between elevation and the percent of damaged spruce trees.