HW 5

STA 242/ENV 255: March 24-26, 1998

Variable Selection

The problem we will consider involves the relationship between mortality and various pollution indices (Source: McDonald, G.C. and Schwing, R.C. (1973) 'Instabilities of regression estimates relating air pollution to mortality', Technometrics, vol.15, 463-482.)

We are particularly interested in whether mortality is related to the pollution variables HC, NOX, and SO2, after adjusting for the other variables.

Variables in order:

PREC Average annual precipitation in inches
JANT Average January temperature in degrees F
JULT Same for July
OVR65 % of 1960 SMSA population aged 65 or older
POPN Average household size
EDUC Median school years completed by those over 22
HOUS % of housing units which are sound & with all facilities
DENS Population per sq. mile in urbanized areas, 1960
NONW % non-white population in urbanized areas, 1960
WWDRK % employed in white collar occupations
POOR % of families with income < $3000
HC Relative hydrocarbon pollution potential
NOX Same for nitric oxides
SO Same for sulphur dioxide
HUMID Annual average % relative humidity at 1pm
MORT Total age-adjusted mortality rate per 100,000

MORT is the response variable. Here is the hyperlink for the data for 60 US cities.

This week we will explore various methods of variable selection. In multiple regression, if we leave out important variables from the model, then we end up with biased parameter estimates, t tests, the works! (look back to chapter 4 table 4.1 in RWG). To try to avoid this, we often measure many, many covariates that are potentially related to Y. However some of these variables may be unrelated to Y, in which case including them in our regression model may result in larger SE's and wider CI/PI, and poor predictions. Variable selection methods were developed to assist researchers in finding some "best" subset of variables. With nonexperimental data sets (i.e. observational data) the independent variables are not "truly independent", in that they are usually correlated with one another, leading sometime to problems with multicollinearity. In this situation, the regression coefficients are greatly affected by the particular subset of independent variables in the model. While automatic variable procedures may appear to be an easy way to find the "best model", they may end up solving the statistical problem, but not necessarily lead to a model that makes a lot of sense on substantial grounds. While we will cover these methods, please use them with caution, and realize that they may miss interesting plausible models from the scientific perspective. If you can use scientific knowledge in model building you should do so!

REVIEW Problem: Test Ho: there is no relationship (linear) between MORT and HC, SO, and NOX after adjusting for the other variables. Use the output from the Full regression and the regression under Ho.

What is the regression model under Ho?
What is the Ha?
What is the test statistic?
What is the distribution of the test statistic under Ho?
Would you reject Ho at alpha = .05?
What conclusion can you make?

Variable selection procedures:

BACKWARDS
FORWARDS
STEPWISE
MAXR

SAS COMMANDS:

PROC REG DATA=sasuser.mydata;
 model Y = x1 x2 x3 x4 x5 x6 x7/ selection=BACKWARDS
 model Y = x1 x2 x3 x4 x5 x6 x7/ selection=FORWARD;
 model Y = x1 x2 x3 x4 x5 x6 x7/ selection=STEPWISE;
 model Y = x1 x2 x3 x4 x5 x6 x7/ selection=MAXR;
RUN;

Let's look at the output from MAXR