HW 5

STA 242/ENV 255: March 25, 1998

Variable Selection

Assignment: Due Tuesday, March 31

This week we will explore various methods of variable selection in SAS. In multiple regression, if we leave out important variables from the model, then we end up with biased parameter estimates, t tests, the works! (look back to chapter 4 table 4.1 in RWG). To try to avoid this, we often measure many, many covariates that are potentially related to Y. However some of these variables may be unrelated to Y, in which case including them in our regression model may result in larger SE's and wider CI/PI, and poor predictions . Variable selection methods were developed to assist researchers in finding some "best" subset of variables. With nonexperimental data sets (i.e. observational data) the independent variables are not "truly independent", in that they are usually correlated with one another, leading sometime to problems with multicollinearity. In this situation, the regression coefficients are greatly affected by the particular subset of independent variables in the model. While automatic variable procedures may appear to be an easy way to find the "best model", they may end up solving the statistical problem, but not necessarily lead to a model that makes a lot of sense on substantial grounds. While we will cover these methods, please use them with caution, and realize that they may miss interesting plausible models from the scientific perspective. If you can use scientific knowledge in model building you should do so!

The problem we will consider involves the relationship between mortality and various pollution indices (Source: McDonald, G.C. and Schwing, R.C. (1973) 'Instabilities of regression estimates relating air pollution to mortality', Technometrics, vol.15, 463-482.)

Variables in order:

PREC Average annual precipitation in inches
JANT Average January temperature in degrees F
JULT Same for July
OVR65 % of 1960 SMSA population aged 65 or older
POPN Average household size
EDUC Median school years completed by those over 22
HOUS % of housing units which are sound & with all facilities
DENS Population per sq. mile in urbanized areas, 1960
NONW % non-white population in urbanized areas, 1960
WWDRK % employed in white collar occupations
POOR % of families with income < $3000
HC Relative hydrocarbon pollution potential
NOX Same for nitric oxides
SO2 Same for sulphur dioxide
HUMID Annual average % relative humidity at 1pm
MORT Total age-adjusted mortality rate per 100,000

MORT is the response variable. We are particularly interested in whether mortality is related to the pollution variables HC, NOX, and SO2, after adjusting for the other variables. here are the data for 60 US cities.

NOTE: Summarize your output and only turn in what you need to support your answer. The graders will not read through pages of output to search for your answer.

Assignment:

Fit the Full regression model with MORT as Y with all the other independent variables in SAS INSIGHT. a) Examine the "usual" residual plots and discuss if the assumptions for OLS seem reasonable. If you need to transform MORT, do so. b) Next, examine the usual case statistics and see if there are influential cases or outliers. (remove any cases if necessary) c) Interpret the significance and meaning of the coefficients for the 3 pollution variables from the multiple regression. Discuss whether there are problems of multicollinearity and how this affects your interpretation.
Starting with the Full model with all predictors, test the hypothesis that the coefficients for NOX, HC and SO are all 0. What conclusion would you make?
Starting with the model in 1), delete the variable with the largest p-value. Refit the model, and repeat until all variables are significant at you selected alpha value. Interpret the significance and meaning of the coefficients for the pollution variables. Discuss whether there are problems of multicollinearity and how this affects your interpretation. Exit from SAS INSIGHT.
Here are the generic SAS commands for using the various variable selection procedures:
```
PROC REG DATA=sasuser.mydata;
 model Y = x1 x2 x3 x4 x5 x6 x7/ selection=FORWARD slentry=.1;
 model Y = x1 x2 x3 x4 x5 x6 x7/ selection=STEPWISE slentry=.1 slstay=.05;
 model Y = x1 x2 x3 x4 x5 x6 x7/ selection=BACKWARD slstay=.1;
 model Y = x1 x2 x3 x4 x5 x6 x7/ selection=MAXR;
 model Y = x1 x2 x3 x4 x5 x6 x7/ selection=ADJRSQR;
RUN;
```
Backwards selection is similar to what you did above. In forward selection, we start with the variable most correlated with Y, and then add the variable that produces the largest |t-statistic| or largest R2. This is stopped if the added variable is not significant. Stepwise is similar, but allows variables that are already in the model to leave, if they are no longer significant after adding the next variable. MAXR, finds the models with 1, 2, 3, etc variables which have the highest R-SQUARE. ADJRSQR finds the models with the highest Adjusted R-Square. The term Slentry=.1 controls the significance level for variables to "enter" the model. Slstay=.1 means that the variable has to be significant at the alpha = .1 level to "stay" in the model. You can change these to other values.
For the pollution data use these variable selection methods to find a "best" subset of variables. Do they all find the same "best" model? Compare the models and what they imply about pollution and mortality. Do the results from your hypothesis test in 2 agree with your model selection results? What could be going on here?
Based on your analyses above: write a one page (or less) summary at the level of a government policy analyst describing your findings and suggestions for an appropriate model(s). This should stand on its own and the reader should not have to sift through pages and pages of output to understand your results. Any numbers that are important should be included in the text!