STA 242/ENV 255: March 24-26, 1998
Variable Selection
The problem we will consider involves the relationship between
mortality and various pollution indices (Source: McDonald, G.C. and
Schwing, R.C. (1973) 'Instabilities of regression estimates relating
air pollution to mortality', Technometrics, vol.15, 463-482.)
We are particularly interested in whether
mortality is related to the pollution variables HC, NOX, and SO2,
after adjusting for the other variables.
Variables in order:
- PREC Average annual precipitation in inches
- JANT Average January temperature in degrees F
- JULT Same for July
- OVR65 % of 1960 SMSA population aged 65 or older
- POPN Average household size
- EDUC Median school years completed by those over 22
- HOUS % of housing units which are sound & with all facilities
- DENS Population per sq. mile in urbanized areas, 1960
- NONW % non-white population in urbanized areas, 1960
- WWDRK % employed in white collar occupations
- POOR % of families with income < $3000
- HC Relative hydrocarbon pollution potential
- NOX Same for nitric oxides
- SO Same for sulphur dioxide
- HUMID Annual average % relative humidity at 1pm
- MORT Total age-adjusted mortality rate per 100,000
MORT is the response variable.
Here is the hyperlink for the data for 60 US cities.
This week we will explore various methods of variable selection.
In multiple regression, if we leave out important variables from
the model, then we end up with biased parameter estimates, t tests,
the works! (look back to chapter 4 table 4.1 in RWG). To try to
avoid this, we often measure many, many covariates that are
potentially related to Y. However some of these variables may be
unrelated to Y, in which case including them in our regression model
may result in larger SE's and wider CI/PI, and poor predictions.
Variable selection methods were developed to assist researchers in
finding some "best" subset of variables. With nonexperimental data
sets (i.e. observational data) the independent variables are not
"truly independent", in that they are usually correlated with one
another, leading sometime to problems with multicollinearity. In this
situation, the regression coefficients are greatly affected by the
particular subset of independent variables in the model. While
automatic variable procedures may appear to be an easy way to find the
"best model", they may end up solving the statistical problem, but not
necessarily lead to a model that makes a lot of sense on substantial
grounds. While we will cover these methods, please use them with
caution, and realize that they may miss interesting plausible models
from the scientific perspective. If you can use scientific knowledge
in model building you should do so!
REVIEW Problem: Test Ho: there is no relationship (linear) between MORT and HC, SO, and NOX after adjusting for the other variables. Use the output from the
Full regression and the regression under Ho.
- What is the regression model under Ho?
- What is the Ha?
- What is the test statistic?
- What is the distribution of the test statistic under Ho?
- Would you reject Ho at alpha = .05?
- What conclusion can you make?
Variable selection procedures:
- BACKWARDS
- FORWARDS
- STEPWISE
- MAXR
- SAS COMMANDS:
PROC REG DATA=sasuser.mydata;
model Y = x1 x2 x3 x4 x5 x6 x7/ selection=BACKWARDS
model Y = x1 x2 x3 x4 x5 x6 x7/ selection=FORWARD;
model Y = x1 x2 x3 x4 x5 x6 x7/ selection=STEPWISE;
model Y = x1 x2 x3 x4 x5 x6 x7/ selection=MAXR;
RUN;
Let's look at the output from MAXR