STA 242/ENV 255: March 24-26, 1998

Variable Selection

The problem we will consider involves the relationship between mortality and various pollution indices (Source: McDonald, G.C. and Schwing, R.C. (1973) 'Instabilities of regression estimates relating air pollution to mortality', Technometrics, vol.15, 463-482.)

We are particularly interested in whether mortality is related to the pollution variables HC, NOX, and SO2, after adjusting for the other variables.

Variables in order:

  1. PREC Average annual precipitation in inches
  2. JANT Average January temperature in degrees F
  3. JULT Same for July
  4. OVR65 % of 1960 SMSA population aged 65 or older
  5. POPN Average household size
  6. EDUC Median school years completed by those over 22
  7. HOUS % of housing units which are sound & with all facilities
  8. DENS Population per sq. mile in urbanized areas, 1960
  9. NONW % non-white population in urbanized areas, 1960
  10. WWDRK % employed in white collar occupations
  11. POOR % of families with income < $3000
  12. HC Relative hydrocarbon pollution potential
  13. NOX Same for nitric oxides
  14. SO Same for sulphur dioxide
  15. HUMID Annual average % relative humidity at 1pm
  16. MORT Total age-adjusted mortality rate per 100,000
MORT is the response variable. Here is the hyperlink for the data for 60 US cities.
This week we will explore various methods of variable selection. In multiple regression, if we leave out important variables from the model, then we end up with biased parameter estimates, t tests, the works! (look back to chapter 4 table 4.1 in RWG). To try to avoid this, we often measure many, many covariates that are potentially related to Y. However some of these variables may be unrelated to Y, in which case including them in our regression model may result in larger SE's and wider CI/PI, and poor predictions. Variable selection methods were developed to assist researchers in finding some "best" subset of variables. With nonexperimental data sets (i.e. observational data) the independent variables are not "truly independent", in that they are usually correlated with one another, leading sometime to problems with multicollinearity. In this situation, the regression coefficients are greatly affected by the particular subset of independent variables in the model. While automatic variable procedures may appear to be an easy way to find the "best model", they may end up solving the statistical problem, but not necessarily lead to a model that makes a lot of sense on substantial grounds. While we will cover these methods, please use them with caution, and realize that they may miss interesting plausible models from the scientific perspective. If you can use scientific knowledge in model building you should do so!


REVIEW Problem: Test Ho: there is no relationship (linear) between MORT and HC, SO, and NOX after adjusting for the other variables. Use the output from the Full regression and the regression under Ho.

  1. What is the regression model under Ho?
  2. What is the Ha?
  3. What is the test statistic?
  4. What is the distribution of the test statistic under Ho?
  5. Would you reject Ho at alpha = .05?
  6. What conclusion can you make?


Variable selection procedures:
  1. BACKWARDS
  2. FORWARDS
  3. STEPWISE
  4. MAXR
  5. SAS COMMANDS:
    PROC REG DATA=sasuser.mydata;
     model Y = x1 x2 x3 x4 x5 x6 x7/ selection=BACKWARDS
     model Y = x1 x2 x3 x4 x5 x6 x7/ selection=FORWARD;
     model Y = x1 x2 x3 x4 x5 x6 x7/ selection=STEPWISE;
     model Y = x1 x2 x3 x4 x5 x6 x7/ selection=MAXR;
    RUN;
    
Let's look at the output from MAXR