STA 242/ENV 255

Lab Agenda

These problems will be incorporated into HW5, so you should work through them. I hope to have HW5 finalized before class on Thursday.

Use the computer to simulate 100 data points from a normal distribtuion with mean 0 and standard deviation 1. Store the results in a column called Y. Repeat this process 10 more times, storing results in X1, ..., X10. Note that Y will be totally unrelated to the explanatory variables. Use model selection among all subsets, based on Cp, BIC, and AIC. What model is suggested by each criterion? What danger (if any) is there in using a variable selection technique when the number of explanatory variables is a substantial proportion of the sample size?
How to do this in Splus:
Go to "File" and "New" and select "Data Set". This is where you will store the simulated data.

To create the Y variable, go to "Data" - "Random Numbers". Under "Target Column" type in "Y". "Sample Size" is 100 and "Distribution" is "normal". "Mean" is 0 and "Standard Deviation" is 1. Repeat this process in order to create "X1" through "X10".

Here are some Splus commands to calculate Cp, AIC, and BIC:
For Cp, it's easy:
out_leaps(y=Y,x=cbind(X1,X2,X3,X4,X5,X6,X7,X8,X9,X10),method="Cp")

What does this give you? What about the Cp for the "NULL" model? Calculate it using the formula from the book. You'll need sigma2 for full and "NULL" model.
Recall that the "NULL" model in the command line is: lm(Y~1). Which model does Cp select?

For BIC and AIC, you'll run the function bicreg.s. First load it as a scriptfile into Splus. Read the comments at the start of the function so you know what the inputs and outputs are.
I'll provide another version of bicreg, with a "keep=" argument soon, for those of you who need it for your projects.

bicreg(y=Y,x=cbind(X1,X2,X3,X4,X5,X6,X7,X8,X9,X10))
### To manage output, you might try this:
out_bicreg(y=Y,x=cbind(X1,X2,X3,X4,X5,X6,X7,X8,X9,X10))
data.frame(out$label[1:20],out$bic[1:20],out$aic[1:20])

Just to make the point once more, use the menu options to run stepwise to see what the "best" model is.
Working through the results of the paper handed out in class. The body fat data are located here. You'll need to copy this to a text file, and omit the top part and the very bottom part in order to read it into Splus. You may want to make a header column with variable names.
If this were your project, you would:
- make pairs plots of variables, and calculate a correlation matrix. Which X variables appear related to Y, and which X's appear correlated with each other?
- Fit a tentative model and examine the effect of transformations on residual plots. The tentative model in this case could be the main effects model. For your tentative model, check outliers, residual plots, influential points.
- regress X's on other X's to determine the R2 and the tolerance (as discussed in class). Make decisions about highly correlated X's: leave in and lose ability to interpret separately or carefully remove variables that seem very correlated with others.
- consider an all subsets procedure, using the code given above.
- narrow down a model or two of interest, check outliers, residual plots, influential points.
- begin the process of answering research questions by interpretting coef's, and using the model to answer questions about the dynamics of the problem.
- Balance your time between model selection and interpreting your results. Spend time thinking about the model and whether it makes sense.
For the purpose of lab, make sure you know where the ingredients are to reproduce Tables 6-9. Specifically, for HW5, you'll reproduce a version of Table 8, without the column on "stepwise p-value". Save the ingredients so that you can make this table later. Also for HW5, save the top 10 posterior probabilities, their associated model sizes, and the labels for those models (that is, what variables are in them).
If you forget what objects are in "bicreg", assign your bicreg command a name (like "out" above), and type "names(out)". Then take a look at the documentation for the function to see what these are.
A note: As you compare your answers, your answers should be "in the ballpark" of the numbers in the paper. Just look at order of magnitude answers. The authors used a slightly different dataset than the one we'll be working with, so the numbers will differ, but the general conclusions should be the same.