STA 242/ENV 255: March 25, 1998
Variable Selection
Assignment: Due Tuesday, March 31
This week we will explore various methods of variable selection in
SAS. In multiple regression, if we leave out important variables from
the model, then we end up with biased parameter estimates, t tests,
the works! (look back to chapter 4 table 4.1 in RWG). To try to
avoid this, we often measure many, many covariates that are
potentially related to Y. However some of these variables may be
unrelated to Y, in which case including them in our regression model
may result in larger SE's and wider CI/PI, and poor predictions
. Variable selection methods were developed to assist researchers in
finding some "best" subset of variables. With nonexperimental data
sets (i.e. observational data) the independent variables are not
"truly independent", in that they are usually correlated with one
another, leading sometime to problems with multicollinearity. In this
situation, the regression coefficients are greatly affected by the
particular subset of independent variables in the model. While
automatic variable procedures may appear to be an easy way to find the
"best model", they may end up solving the statistical problem, but not
necessarily lead to a model that makes a lot of sense on substantial
grounds. While we will cover these methods, please use them with
caution, and realize that they may miss interesting plausible models
from the scientific perspective. If you can use scientific knowledge
in model building you should do so!
The problem we will consider involves the relationship between mortality and various pollution indices (Source: McDonald, G.C. and Schwing, R.C. (1973) 'Instabilities of regression estimates relating air pollution to mortality', Technometrics, vol.15, 463-482.)
Variables in order:
- PREC Average annual precipitation in inches
- JANT Average January temperature in degrees F
- JULT Same for July
- OVR65 % of 1960 SMSA population aged 65 or older
- POPN Average household size
- EDUC Median school years completed by those over 22
- HOUS % of housing units which are sound & with all facilities
- DENS Population per sq. mile in urbanized areas, 1960
- NONW % non-white population in urbanized areas, 1960
- WWDRK % employed in white collar occupations
- POOR % of families with income < $3000
- HC Relative hydrocarbon pollution potential
- NOX Same for nitric oxides
- SO2 Same for sulphur dioxide
- HUMID Annual average % relative humidity at 1pm
- MORT Total age-adjusted mortality rate per 100,000
MORT is the response variable. We are particularly interested in whether
mortality is related to the pollution variables HC, NOX, and SO2,
after adjusting for the other variables. here are the data for 60 US
cities.
NOTE: Summarize your output and only turn in what you need to support your answer. The graders will not read through pages of output to search for your answer.
Assignment:
- Fit the Full regression model with MORT as Y with all the other
independent variables in SAS INSIGHT. a) Examine the "usual" residual
plots and discuss if the assumptions for OLS seem reasonable. If you
need to transform MORT, do so. b) Next, examine the usual case
statistics and see if there are influential cases or outliers. (remove any cases if necessary)
c) Interpret the significance and meaning of the coefficients for the 3 pollution variables from the multiple regression. Discuss whether there are
problems of multicollinearity and how this affects your
interpretation.
- Starting with the Full model with all predictors, test the hypothesis that the coefficients for NOX, HC and SO are all 0.
What conclusion would you make?
- Starting with the model in 1), delete the variable with the
largest p-value. Refit the model, and repeat until all
variables are significant at you selected alpha value. Interpret the
significance and meaning of the coefficients for the pollution
variables. Discuss whether there are problems of multicollinearity
and how this affects your interpretation.
Exit from SAS INSIGHT.
- Here are the generic SAS commands for using the various variable selection procedures:
PROC REG DATA=sasuser.mydata;
model Y = x1 x2 x3 x4 x5 x6 x7/ selection=FORWARD slentry=.1;
model Y = x1 x2 x3 x4 x5 x6 x7/ selection=STEPWISE slentry=.1 slstay=.05;
model Y = x1 x2 x3 x4 x5 x6 x7/ selection=BACKWARD slstay=.1;
model Y = x1 x2 x3 x4 x5 x6 x7/ selection=MAXR;
model Y = x1 x2 x3 x4 x5 x6 x7/ selection=ADJRSQR;
RUN;
Backwards selection is similar to what you did above. In forward
selection, we start with the variable most correlated with Y, and then
add the variable that produces the largest |t-statistic| or largest R2.
This is stopped if the added variable is not significant. Stepwise is
similar, but allows variables that are already in the model to leave,
if they are no longer significant after adding the next variable. MAXR,
finds the models with 1, 2, 3, etc variables which have the highest
R-SQUARE. ADJRSQR finds the models with the highest Adjusted
R-Square. The term Slentry=.1 controls the significance level for
variables to "enter" the model. Slstay=.1 means that the variable has
to be significant at the alpha = .1 level to "stay" in the model. You
can change these to other values.
For the pollution data use these variable selection methods to find a "best" subset of variables. Do they all find the same "best" model? Compare the models and what they imply about pollution and mortality.
Do the results from your hypothesis test in 2 agree with your model selection results? What could be going on here?
- Based on your analyses above: write a one page (or less) summary
at the level of a government policy analyst describing your findings
and suggestions for an appropriate model(s). This should stand on its
own and the reader should not have to sift through pages and pages of
output to understand your results. Any numbers that are important
should be included in the text!