ENV255/STA242
Wednesday, March 26 Lab Agenda
Chapter 11 and WA3 Splus Tools:
Chapter 12 Splus Tools:
Chapter 11 and WA3 Splus Tools
Notes on Part 3.
We will use partial residual plots to examine the relationship between SAT and expenditure after accounting for (log) percentage of takers and class rank. This is part of the initial exploration of the SAT data explained on pages 348-349 of Sleuth.
The preliminary model under consideration at this stage is: SAT~log(takers)+class.rank+expenditures. We know we can assess the significance of expenditures by looking at a p-value, but a graphical analysis is important if we are characterizing the role of expenditures on SAT scores adjusted for log(takers) and class rank. See Section 11.5.2 and 12.2.3.
There are a number of ways to create partial residual plots. For WA3, we'll focus on the first method below . Two other methods are also mentioned.
Fit the preliminary model so that you have a coefficient estimate of "expenditures".
Method 1. Required for Writing Assignment 3.
Here are the Splus commands to create the plot:
attach(satscores)
lm.partres.sat_lm(sat~log.takers+rank)
lm.partres.expend_lm(expend~log.takers+rank)
plot(lm.partres.sat$residuals,lm.partres.expend$residuals) #a plot of this type is in the WA3 paper.
Now regress the residuals of lm.partres.sat on lm.partres.expend and look at the slope.
partres.regr_lm(lm.partres.sat$residuals~lm.partres.expend$residuals)
print(summary(partres.regr))
The slope here should be the same one as when you fit the preliminary model. While the slope is statistically significant, the magnitude of the slope seems to be affected by particular observations. Which ones are these?
diagplot.fun(lm.partres.expend$residuals,lm.partres.sat$residuals)
Observation number 29 seems to have high leverage and influence (as measured by the absolute value of externally studentized residuals, Cooks Distance, and DFFITS (DFFITS to be explained in class)). This is Alaska. The same 6 Splus commands above can be run with and without Alaska to determine the effect of including Alaska on the slope, and to make a decision on whether to leave Alaska in the model. See the argument at the top of page 350.
Another use of partial residual plots is to determine if the variable of interest, in this case, expenditure, needs to be modeled in a non-linear way. Check residual plots to decide if non-linearity is justified:
plot(partres.regr$residuals,partres.regr$fitted)
qqnorm(partres.regr$residuals)
Neither of these indicate a non-linear function of expenditures needs in the regression model with log(takers) and rank needs further consideration, but these do confirm that we should deal with Alaska.
A note on the use of influence diagnostics: Above I demonstrate the use of influence diagnostics in the case of a simple linear regression (SLR). You may not need to use this kind of machinery in the SLR case since Alaska clearly an influential observation based on the partial residual plot. However, remember that these diagnostic plots are very useful in multivariate regression, when you are trying to find an outlier when the X space (the number of X variables) has more than 1 dimension, as it does in the SLR.
Method 2. following directions on page 325 (middle of page)
prelimod_lm(sat~log.takers+rank+expend)
partresids_prelimod$residuals+prelimod$coefficients[4]*expend
plot(expend,partresids)
You'll note that the plot is similar in shape, and the results found in Method 1 are the same. This plot is the same as the one in the book on page 349.
Method 3. Splus windows. Under "Regression" fit the full model of interest. Under the "Results" tab, click "Partial Residual Plots" "Common Y scale" and "Include Partial Fit". Splus will use standardized residuals to create a partial residual plot for each variable, so your figure will not match Display 12.5, but your findings should be the same.
Chapter 12 Splus Tools
> names(satscores)
[1] "state" "sat" "takers" "income" "years"
[6] "public" "expend" "rank"
sat.mod.search.adjr2_leaps(X.mat,sat, method="adjr2")
To organize the output more efficiently, these commands order the data according to adjusted r2 and then print the X variables with their corresponding adjusted r2.
ord_order(sat.mod.search.adjr2$adjr2)
data.frame(sat.mod.search.adjr2$label[ord],sat.mod.search.adjr2$adjr2[ord])
select.aic.bic.fun_function(x,y){
## this program works for a search of models with main effects and no interactions
n_length(y)
r2info_leaps(x,y,method="r2")
sstot_crossprod(y-mean(y))
ssres_(1-r2info$r2/100)*sstot
p_r2info$size
sigma2hat_ssres/(n-p)
AIC_n*log(sigma2hat)+2*p
BIC_n*log(sigma2hat)+p*log(n)
ord_order(AIC)
output_data.frame(Model=r2info$label[ord],AIC, BIC)
return(output)
}
Now Run from the command line:
select.aic.bic.fun(X.mat,sat)
What are the top 3 models selected by AIC? BIC? Adjusted R2?