STA 242/ENV 255
Lab Agenda
Use the computer to simulate 100 data points from a normal distribtuion with mean 0 and standard deviation 1. Store the results in a column called Y. Repeat this process 10 more times, storing results in X1, ..., X10. Note that Y will be totally unrelated to the explanatory variables. Use model selection among all subsets, based on Cp, BIC, and AIC. What model is suggested by each criterion? What danger (if any) is there in using a variable selection technique when the number of explanatory variables is a substantial proportion of the sample size?
How to do this in Splus: Go
to "File" and "New" and select "Data Set".
This is where you will store the simulated data.
To
create the Y variable, go to "Data" - "Random
Numbers". Under "Target Column" type in "Y".
"Sample Size" is 100 and "Distribution" is
"normal". "Mean" is 0 and "Standard
Deviation" is 1. Repeat this process in order to create "X1"
through "X10".
Here are some Splus commands to calculate Cp, AIC, and BIC:
For Cp, it's easy:
What does this give you? What about the Cp for the "NULL" model? Calculate it using the formula from the book. You'll need sigma2 for full and "NULL" model.
I'll provide another version of bicreg, with a "keep=" argument soon, for those of you who need it for your projects.
bicreg(y=Y,x=cbind(X1,X2,X3,X4,X5,X6,X7,X8,X9,X10))
Just to make the point once more, use the menu options to run stepwise to see what the "best" model is.
If this were your project, you would:
For the purpose of lab, make sure you know where the ingredients are to reproduce Tables 6-9. Specifically, for HW5, you'll reproduce a version of Table 8, without the column on "stepwise p-value". Save the ingredients so that you can make this table later. Also for HW5, save the top 10 posterior probabilities, their associated model sizes, and the labels for those models (that is, what variables are in them).
If you forget what objects are in "bicreg", assign your bicreg command a name (like "out" above), and type "names(out)". Then take a look at the documentation for the function to see what these are.
A note: As you compare your answers, your answers should be "in the ballpark" of the numbers in the paper. Just look at order of magnitude answers. The authors used a slightly different dataset than the one we'll be working with, so the numbers will differ, but the general conclusions should be the same.
out_leaps(y=Y,x=cbind(X1,X2,X3,X4,X5,X6,X7,X8,X9,X10),method="Cp")
Recall that the "NULL" model in the command line is: lm(Y~1). Which model does Cp select?
For BIC and AIC, you'll run the function bicreg.s. First load it as a scriptfile into Splus. Read the comments at the start of the function so you know what the inputs and outputs are.
### To manage output, you might try this:
out_bicreg(y=Y,x=cbind(X1,X2,X3,X4,X5,X6,X7,X8,X9,X10))
data.frame(out$label[1:20],out$bic[1:20],out$aic[1:20])