Chapter 12: Model Selection

This lab will focus on the gender discrimination data, Case Study 12.02. The Sleuth authors focus on finding the models (that don't include gender) that best describe the log(salary in 1977). Once they have identified a few good models, they then add in the gender effect to determine its sign and significance in the context of those models.

Make sure you understand the commands you are typing below. Stop to ask the TAs.

Stepwise Model Selection
For illustration, we will look at main effects in the gender discrimination data. Perform a sequential model selection procedure using the "stepwise" procedure under the "Regression" menu. Your largest model would be: log(sal77)~senior+age+educ+exper.
attach(Case1202)
stepwise(cbind(senior,age,educ,exper),log(sal77), f.crit=c(4,2),method="efroymson")
The stepwise procedures will add a predictor that meets 3 equivalent criteria:
1. highest sample partial correlation in absolute value with the response, adjusting for the predictors already in the model (mentioned in documentation for "stepwise" which is what they'll use)
2. adding the variable will increase R2 more than any other single variable
3. the added variable would have the largest t or F statistic of any of the variables that are not already in the model.
Note f.crit above: This must be a numerical value (or vector of 2 values) that specify the F value(s) to be used as criteria for adding or deleting variables to/from the subset when using Efroymson's method. If two values are provided, the first is the F-to-enter and the second is the F-to-delete.
Selection Among All Subsets.
Now we will look at methods to perform all subset selection on the gender discrimination data. As written in Section 12.6, we begin by examining salary on the log scale, and a "saturated second order model" is chosen to start.
Data Preparation
## variables are "bsal" "sal77" "fsex" "senior" "age" "educ" "exper"

## desire a "rich" model with log transformed salary variable,
## squared terms interaction terms

## log transformed salary
log.bsal_log(bsal)

## squared variables
senior2_senior^2
age2_age^2
educ2_educ^2
exper2_exper^2

## interaction terms
senior.i.age_senior*age
senior.i.educ_senior*educ
senior.i.exper_senior*exper
age.i.educ_age*educ
age.i.exper_age*exper
educ.i.exper_educ*exper

Model Selection Functions
1. Model Selection Among Subsets: select.cp.bic.aic.s.
  1. This function gives Cp, BIC, AIC, R2 and R2adj for a set of models.
  2. To run: select.cp.aic.bic.fun(cbind(x1,x2,x3,x4,x5),y)
  3. If you wish to "keep" some X variables in all regression models, specify these by typing: select.cp.aic.bic.fun(cbind(x1,x2,x3,x4,x5),y,keepit=1)
    The above command will keep the 1st variable, "x1", in all regressions. If you want to keep more than one variable, for example, you can type "keepit=1:3" for 1,2,3.
2. Posterior Beliefs about Different Models: bicreg.s
  1. This function gives the following outputs for a set of models:
    - $postprob: posterior probabilities of the models selected
    - $label: labels identifying the models selected
    - $r2: R2 values for the models
    - $bic: values of bic for the models
    - $aic: values of aic for the models
    - $size: the number of independent variables in each of the models
    - $probne0: posterior probability that each variable is non-zero (in %)
    - $postmean: posterior mean of each coefficient (from model mixing)
    - $postsd: posterior standard deviation of each coefficient (from model mixing)
  2. To run: bicreg.fun(cbind(x1,x2,x3,x4,x5),y)
  3. To run while "keeping" terms 2 and 3 for example, bicreg.fun(cbind(x1,x2,x3,x4,x5),y,keepit=2:3)
Steps for the Gender Discrimination Data
1. First, run the two model selection functions on the set of models with 4 main effects or less.
  - select.cp.aic.bic.fun(cbind(senior, age, educ, exper), log.bsal)
    Which models have the lowest AIC, Cp, BIC? Give the top 3.
  - bicreg.fun(cbind(senior, age, educ, exper), log.bsal)
    Which models have the highest posterior probability?
    Looking at main effects only, we have log.bsal~senior+educ+exper, log.bsal~senior+educ, log.bsal~senior+age+educ+exper. For exploration of interactions and second order effects, we'll keep the four main effects, keeping in mind that we may want to examine whether age effects and experience effects (combined with their related interaction and squared terms) belong in the model later on.
2. We now consider interactions and squared terms. Note that we'll keep the 4 main effects in the model while we search through models with interactions and squared terms. (Our goal in using the "keep" command is to avoid searching through models with higher order terms and without their corresponding main effects.)
  - select.cp.aic.bic.fun(cbind(senior,age,educ,exper,senior2,age2,educ2,exper2,senior.i.age,senior.i.educ,senior.i.exper,age.i.educ,age.i.exper,educ.i.exper), log.bsal,keepit=1:4)
  - bicreg.fun(cbind(senior,age,educ,exper,senior2,age2,educ2,exper2,senior.i.age,senior.i.educ,senior.i.exper,age.i.educ,age.i.exper,educ.i.exper), log.bsal,keepit=1:4)
  - ## here we'll limit the amount of output produced
    ## assign bicreg output to an object and reference just part of it
    out.bayes_bicreg.fun(cbind(senior,age,educ,exper,senior2,age2,educ2,exper2,senior.i.age,senior.i.educ,senior.i.exper,age.i.educ,age.i.exper,educ.i.exper), log.bsal,keepit=1:4)
    print(out.bayes$label[1:10]) ## look at just the first 10 models
    print(out.bayes$postprob[1:10]) ## and their posterior probabilities
3. A possible model selection strategy at this point is to examine plausible (no interactions or squared terms without main effects included) models among these variables: main effects of senior, age, educ and exper; interactions of age.i.educ,age.i.exper, and educ.i.exper; a possible quadratic term of exper2.
4. Skipping ahead to the question asked in the book, what is the effect of gender in two of the top models?
  
  print("*****output for SAEXCK*****")
  summary(lm(log(bsal)~age+educ+exper+senior+age*educ+age*exper+fsex))
  
  print("*****output for SAEXNCK*****")
  summary(lm(log(bsal)~age+educ+exper+senior+age*educ+age*exper+senior*educ+fsex))
5. Compare the 2 models before presenting the final model. Use an ESS test to determine if seniority matters. What is the significance of "senior" and "senior*educ"?

Last modified: Tue Mar 23 18:09:58 EST 2004