Wednesday, March 26 Lab Agenda

ENV255/STA242

Wednesday, March 26 Lab Agenda

Chapter 11 and WA3 Splus Tools:

Read in data for WA3 and make a pairs plot to get started.
Ensure that you can complete all tasks (loading scripts, interpreting output) from last week’s lab on measures of case influence.
Create a partial residual plot using the method outlined in WA3 and interpret.

Chapter 12 Splus Tools:

Use of the "leaps" command in model selection.
Calculation of AIC for models in a "model space."

Chapter 11 and WA3 Splus Tools

Notes on Part 3.

We will use partial residual plots to examine the relationship between SAT and expenditure after accounting for (log) percentage of takers and class rank. This is part of the initial exploration of the SAT data explained on pages 348-349 of Sleuth.

The preliminary model under consideration at this stage is: SAT~log(takers)+class.rank+expenditures. We know we can assess the significance of expenditures by looking at a p-value, but a graphical analysis is important if we are characterizing the role of expenditures on SAT scores adjusted for log(takers) and class rank. See Section 11.5.2 and 12.2.3.

There are a number of ways to create partial residual plots. For WA3, we'll focus on the first method below . Two other methods are also mentioned.

Fit the preliminary model so that you have a coefficient estimate of "expenditures".

Method 1. Required for Writing Assignment 3.

Fit the model "SAT~log(takers)+class rank" (SAT scores adjusted for takers and class rank). Save the residuals.
Now fit the model of "expenditures~log(takers)+class rank" (expenditures adjusted for takers and class rank). Save the residuals.

Here are the Splus commands to create the plot:

attach(satscores)

lm.partres.sat_lm(sat~log.takers+rank)

lm.partres.expend_lm(expend~log.takers+rank)

plot(lm.partres.sat$residuals,lm.partres.expend$residuals) #a plot of this type is in the WA3 paper.

Now regress the residuals of lm.partres.sat on lm.partres.expend and look at the slope.

partres.regr_lm(lm.partres.sat$residuals~lm.partres.expend$residuals)

print(summary(partres.regr))

The slope here should be the same one as when you fit the preliminary model. While the slope is statistically significant, the magnitude of the slope seems to be affected by particular observations. Which ones are these?

diagplot.fun(lm.partres.expend$residuals,lm.partres.sat$residuals)

Observation number 29 seems to have high leverage and influence (as measured by the absolute value of externally studentized residuals, Cooks Distance, and DFFITS (DFFITS to be explained in class)). This is Alaska. The same 6 Splus commands above can be run with and without Alaska to determine the effect of including Alaska on the slope, and to make a decision on whether to leave Alaska in the model. See the argument at the top of page 350.

Another use of partial residual plots is to determine if the variable of interest, in this case, expenditure, needs to be modeled in a non-linear way. Check residual plots to decide if non-linearity is justified:

plot(partres.regr$residuals,partres.regr$fitted)

qqnorm(partres.regr$residuals)

Neither of these indicate a non-linear function of expenditures needs in the regression model with log(takers) and rank needs further consideration, but these do confirm that we should deal with Alaska.

A note on the use of influence diagnostics: Above I demonstrate the use of influence diagnostics in the case of a simple linear regression (SLR). You may not need to use this kind of machinery in the SLR case since Alaska clearly an influential observation based on the partial residual plot. However, remember that these diagnostic plots are very useful in multivariate regression, when you are trying to find an outlier when the X space (the number of X variables) has more than 1 dimension, as it does in the SLR.

Method 2. following directions on page 325 (middle of page)

prelimod_lm(sat~log.takers+rank+expend)

partresids_prelimod$residuals+prelimod$coefficients[4]*expend

plot(expend,partresids)

You'll note that the plot is similar in shape, and the results found in Method 1 are the same. This plot is the same as the one in the book on page 349.

Method 3. Splus windows. Under "Regression" fit the full model of interest. Under the "Results" tab, click "Partial Residual Plots" "Common Y scale" and "Include Partial Fit". Splus will use standardized residuals to create a partial residual plot for each variable, so your figure will not match Display 12.5, but your findings should be the same.

Chapter 12 Splus Tools

Read the help file for the leaps command by typing "help(leaps)" in the command line. This command will perform model selection from all possible subsets of a large model that you define, and with some simple computer code, you can rank models by AIC and BIC.
Note we have the following variables in the SAT dataset:

> names(satscores)

[1] "state" "sat" "takers" "income" "years"

[6] "public" "expend" "rank"

We decide that the model space is the set of all submodels of the largest model, the one with all main effects. See Display 12.6 for an illustration of the model space; in particular, look at the bottom half. First we will search based on adjusted R2, then by AIC & BIC. Remember, these automated search methods are no substitute for careful thought about the appropriateness of particular models and variables of interest.

X.mat_data.frame(log.inc=log(income), years, public, expend, rank, log.take=log(takers))

sat.mod.search.adjr2_leaps(X.mat,sat, method="adjr2")

print(sat.mod.search.adjr2)

To organize the output more efficiently, these commands order the data according to adjusted r2 and then print the X variables with their corresponding adjusted r2.

ord_order(sat.mod.search.adjr2$adjr2)

data.frame(sat.mod.search.adjr2$label[ord],sat.mod.search.adjr2$adjr2[ord])

Copy the following function into a new script file:

select.aic.bic.fun_function(x,y){

## this program works for a search of models with main effects and no interactions

n_length(y)

r2info_leaps(x,y,method="r2")

sstot_crossprod(y-mean(y))

ssres_(1-r2info$r2/100)*sstot

p_r2info$size

sigma2hat_ssres/(n-p)

AIC_n*log(sigma2hat)+2*p

BIC_n*log(sigma2hat)+p*log(n)

ord_order(AIC)

output_data.frame(Model=r2info$label[ord],AIC, BIC)

return(output)

}

Now Run from the command line:

select.aic.bic.fun(X.mat,sat)

What are the top 3 models selected by AIC? BIC? Adjusted R2?