Homework 6
1. The full regression model with all variables and all cases included is:
MORT + 1764.98 + 1.91*PREC - 1.94*JANT - 3.10*JULT - 9.07*OVR65 - 106.83*POPN - 17.16 EDUC - 0.65 HOUS + O.003*DENS + 4.46*NONW - 0.19*WWDRK - 0.168*POOR - 0.67*HC + 1.34*NOX + .086*SO + 0.11*HUMID
a) Residual plots. The residual plot shows that the residuals appear normally distributed, fairly homoscedastic and the relation between independent and dependent variables appears to be linear with no obvious influential cases. In short, "all clear". The quantile-normal plot also appears to fulfill the assumption of normally distributed residuals in the OLS model.
b) Case statistics. After looking at the case statistics, I decided to leave all the data-points in the model. This is a value judgement and many people decided to remove some points. As long as you provide a good reason, it's ok to remove a few points. Also, with 60 data points, you can use either the absolute or size-adjusted cutoffs for your case statistics, but be consistent between tests! Several cases had high leverage values (hat values > .5), indicating potential influence in the model. While studentized residuals indicated that a few points were outliers at the alpha=.05 level, there were no Cook's D values above 1, indicating that no points significantly influence the predicted values of the model. No DFBETAS values were near the 2 standard deviation cutoff indicating influence on a specific variable. Note that a few cases (e.g., 48) stand out as having relatively high DFBETAS values for all the pollution variables. Such information may be useful in further analyses.
c) Pollution variables. The first indication that there may be problems associated with the pollution variables is that HC has a negative coefficient. Technically, we should expect a DECLINE in age-adjusted mortality of 67 per 1000 people for every unit increase in relative hydrocarbon pollution potential. We may not know what "relative hydrocarbon pollution potential" is, but we can be sure that it isn't supposed to be GOOD for you! NOX and SO both have positive coefficients, indicating a slight increase in mortality with increases in these pollutants. However, t-tests are not significant for any of the 3 pollutants, indicating that each individual variable does not have a significant effect on the model, given the presence of all other variables in the model.
Scatter plots of the three pollution variables show strong collinearity between these variables. This multi-collinearity could explain the weird coefficient on HC (remember that the high SE's caused by multicollinearity result in imprecise coefficient estimates that are sensitive to individual samples). Other indication of multicollinearity among the pollution variables are the low tolerance values for these variables and high correlations in the correlation matrix. We will also see that SE of these variables change a lot as pollution variables are removed or added to the model in the variable selection process below. Note that multicollinearity does not INVALIDATE the t or F tests in the model or bias the SE or coefficient estimate (see table 4.1). It DOES cause UNCERTAINTY about the coefficient estimates and the SE of variables will be affected by the presence or absence of collinear varibables in the model.
2. Partial F-test We can test whether or not the 3 pollution variables significantly improve the model (Ho: coefficients for HC = NOX = SO = 0) by performing a partial F-test (RWG p. 80):
(RSS{K-H}-RSS{K})/H
(RSS{K}/(n-K)
where,
K=15 parameters (full model)
H=3 parameters (the number you are taking out)
n=60 datapoints (or however many you had)
Get RSS{K-H} by running a regression on all variables EXCEPT the 3 pollution variables.
We get:
(63288-53680)/3 = 2.68
53680/(60-15)
The critical value of F(3,40) at alpha =0.05 is 2.84
Therefore, we do not reject the null hypothesis that HC=NOX=SO=0. The three pollution variables do not significantly improve the model.
NOTE: A lot of people used an alpha level of .1 for this test and therefore rejected the null hypothesis. There are two reasons why people might have used an unusually high alpha level, one is very bad, the other is only technically incorrect: It would be very bad if you wanted the partial F-test to show that the pollutant variables were statistically significantly different from 0 and you just picked the lowest alpha level that showed this as your critical value. This is not good statistics. Although alpha levels are somewhat arbitrary, you need to pick one before the test then see how the test measures up to it. It's only technically incorrect if you picked alpha as .1 because that matches the default levels suggested for the variable selection procedures in part 4. The reason this is technically incorrect is that, because the variable selection procedures use multiple tests, we don't have a real idea what the alpha level is for these procedures, so the two results aren't really comparable in the end. In any case, if you are using an unusually high alpha level in a test, you should explain why. In this case, it would also be useful to explain that the results would change if you used the more traditional alpha = 0.05 level.
3. Backwards elimination. Using an alpha level of 0.05, the manual backwards selection model resulted in the model MORT = JANT + EDU + NONW + HC + NOX. Tolerance values for HC and NOX are still low and partial leverage plots for these variables show the classic pattern of multicollinearity. The coefficients on HC and NOX in this model are -0.98 and 1.99, respectively, illustrating that the strong collinearity between these variables is still having an effect on their coefficient estimates (because HC coef is still negative). This multicollinearity reduces our ability to generalize beyond our samples about the effects of HC and NOX on mortality.
4. Comparing variable selection models. Using the significance levels on the homework web page, the following models were selected as the "best" model for the data set:
Forward selection (slentry=.1):
MORT = PREC + JANT + JULT + EDUC + NONW + SO
R^2(adj) = .70
coef for SO = .26
tolerance for SO = .78
Stepwise selection (slentry=.1, slstay=.05):
MORT = PREC + JANT + EDUC + NONW + SO
R^2(adj) = .69
coef for SO = .28
tolerance for SO = .81
Backward selection (slstay=.1):
MORT = JANT + JULT + POPN + EDUC + NONW + HC + NOX
R^2(adj) = .70
coef for HC = -0.95; coef for NOX = 1.77
tolerance for HC = .02; tolerance for NOX = .02
Adj. R^2 selection:
MORT = PREC + JANT + JULT + OVR65 + POPN + EDUC + DENS + NONW + HC + NOX
R^2(adj) = .71
coef for HC = -0.88 coef for NOX = 1.77
tolerance for HC = .02 tolerance for NOX = .02
Clearly, we need to be careful when using automated selection processes such as these, as we got a different answer for each apparently sensible way of selecting model variables. With respect to the pollution variables, the forward and stepwise selection procedures included SO, while the other two substituted HC and NOX. Since we know that all three variables are collinear, we know that the differences between "best" models do not really mean that one pollutant is much more significant than another. Instead, the pollutant variables explain the same patterns in the data so selecting one accounts for the variability that would be explained by another.
Interestingly, each "best" model includes at least one pollutant variable, indicating that pollutants are important for determining mortality levels in US cities. Note that this contradicts the results of the partial F-test in part 2, which showed that none of the pollution variables significantly helped explain mortality. There are two possible explanations for this apparent contradiction: First, the pollution variables may have multicollinearity with other variables in the model. The partial F-test would therefore show that these variables do not improve the fit of the model because the variability in mortality that they explain is already accounted for by other variables in the model. You could test this possibility by regressing each pollution variable against the non-pollution variables in the model. The second possible explanation is that the variables were not significantly different from 0 at the alpha level of the F-test (0.05), but (since we don't know the total alpha level of the multiple tests in the variable selection procedures) they might have explained enough variability in mortality to be included at a lower alpha level in the selection of the "best" model by the variable selction procedures.