Key to HW3: RWG 103 6-10
6. The regression equation with both dummy variables is:
DAMAGE = 37.284 - 78.897*DUMMY - 0.017*ELEV + 0.109*DUMMY*ELEV
This model has an adjusted R^2 of .52, meaning that this new model, which allows differences in the slopes of north vs. south states, explains 16% more of the variance in the percent of damaged trees than the model that assumes that they have the same slope (R^2=.36). Becasue the t-test shows us that the slope dummy variable (DUMMY*ELEV) is significantly different from 0 (P>.001), we know that allowing the slope of the lines to vary significantly improves the fit of the model.
Looking at the south sites only (DUMMY=0), the model reduces to:
DAMAGE = 37.284 - 0.017*ELEV
Note that the t-test for ELEV is not significant (P>.37, indicating that there is no significant effect of elevation on tree damage for sites in the south.
For sites in the north (DUMMY=1), the model reduces to:
DAMAGE = -41.338 + 0.0912*ELEV
Because the dummy intercept term (DUMMY*ELEV) was significant (by the t-test: P<.0001), we can say that there is a 0.09% increase in tree damage for every 1 m of increase in elevation at sites in the North.
7. It's ok to plot the two graphs seperately for this question, but it's a little more informative to plot the two lines on the same graph (because the scale of the graphs is the same, so it's easier to compare them). It's pretty obvious from this graph that there is a big difference between North and South sites. The North sites have an obvious positive slope and the South sites seem to have a negative slope.
Note that the individual linear regressions give the same results as the model from question 6 after filling in the dummy variables. The advantage of using a dummy variable approach is that you can use the F-tests to determine whether the lines are statistically different from each other (you can test hypotheses regarding differences in the slopes and intercepts of the lines). This is explained on page 91 of your text.
This plot also suggests why the slope from the South sites was not found to be significantly different from 0: (a) The negative slope is not very steep and (b) There are fewer data points for South sites, meaning that getting a significant t-test will be more difficult (remember that the d.f. for a t-test is determined by the number of data points).
8. The graph of the model from question 6 is the same as plotting the two simple regression lines on the same scatter plot (q. 7). Note that the variable DUMMY allows a difference in the intercept of the two lines (because we coded this as 0 vs. 1, we can also say that this number represents the difference in mean damage between the two sites). The variable DUMMY*ELEV allows the model to show a difference in slopes between the two locations.
9. Interpretting residual plots is a subjective excersize, a point which was made clear by differences in the opinions of the TA's about this question. There is certainly a degree of heteroscedasticity (RWG p. 53) apparent in the distribution of the residals, which you should have mentioned. I felt that this trend was not very strong and mainly reflected a difference in sampling between the South sites (at the low end of predicted DAMAGE) and the North sites (at the higher DAMAGE levels). I would have given the overall model and "all clear" sign, with a caveat about differences between varance in the two locations. Others thought that the heteroscedasticity was pronounced enough to warrant concern.
Note that the specific problem of heteroscedasticity is that it is a violation of the assumption of constant variance (not the assumption of normally distributed error, as the residuals can still be normally distributed and not have constant variance). RWG p. 51 explains that this condition lowers the reliability of your SE's, statistical tests, and confidence intervals.
10. If you are rusty on what ANOVA is all about, check it out in a basic statistical text. It's one of the techniques that you will run across over and over again, so you will want to have a basic appreciation for what it does. For this question, the main point is that detecting the difference between two categories of a categorical variable using a dummy variable turns out to be the same as a one-way ANOVA. You should have gotten the following regression equation:
DAMAGE = 14.625 + 26.179*LOCATION
The slope of the regression of damage at South vs. North sites (with no elevation term) is significantly different from 0 (t-test P=.0007) and thus the mean of sites in the North is significantly different from that of the southern sites. The same result is seen in the ANOVA table from the one-way ANOVA.
In the words of one student, "The advantage of using a dummy-variable regression is that you get some other useful information, like the regression coefficients, which give us a whole lot of information. .. It has a lot of extra information like standard errors and t-statistics that can help in analyzing the problem." Another advantage of this approach is that you can understand its implications if you have a basic understanding of regression analysis. I.e., you don't have to take another class just to learn ANOVA!
11. The F-tests for individual model coefficients are the Effect Tests, which are shown in a box of your regression output and individually below the laverage plots for each variable. (Note that they have the same p-values as the t-tests.) You should get the following results:
Ho: LOCATION = 0; P=.0072
Ho: ELEVATION = 0; P = .3724
Ho: LOC*ELEV = 0; P<.0001
At an alpha level of 0.05, you would reject the null hypothesis for LOC and LOC*ELEV, indicating that there is a difference in both the intercepts and slopes of the regression lines for North vs. South sites. The ELEVATION line can not be rejected at any reasonable alpha level, so we note that the relationship between elevation and tree damage is not significant at South sites (as we saw in q.6).
You can think of ANCOVA as ANOVA with a categorical variable. Typically, the results you would get from an ANCOVA are the "Effect Test" box that JMP gives you in the bottom of your ANCOVA results. It simply provides the F-tests for each of the variables in the model. Note that these are the same as the values from the individual F-tests you get from the dummy-variable regression.
An interesting point about the difference between dummy-variable regressions and ANCOVA was made by one student who said, "The advantage of (dummy) regression over ANCOVA is the way we can interpret our results. In our regression, we have more of an idea of the meaning of each null hypothesis. The model is more 'connected' to the science. This is especially the case if we have interactions in our model, since in ANCOVA the highest order interaction 'masks' the main effect (in our case elevation)."