STA 242
Homework and Lab Assignment 5
Case Diagnostics
The purpose of this lab is to learn how to construct case diagnostics, such as leverage, Cook's distance and externally studentized residuals in Splus. In addition we will explore how transformations of variables may impact the case diagnostics.
Refer back to the Brain weight data from Case Study 9.2. Fit the regression of brain weight on body weight, gestation, and log liter size. Obtain a set of case diagnostics for this model. Is any mammal influential in this fit? Refit the model without the influential observation, and obtain a new set of case diagnostics. Are there any influential observations from this fit? Repeat with all observation, but use log transformations of all variables. Are there any influential cases? What lessons about the connection between the need for transformations and influence can be discerned? Summarize your answers in one paragraph; turn in any output, calculations or graphs on a separate sheet (you may refer to them in the summary). All output should be clearly labeled with extraneous material removed. The write up should focus only on the case diagnostics and role of transformations. You DO NOT need to interpret the models.
Identifying which mammals have larger brain weights than were predicted by the model might point the way to further variables that can be examined . Using the log transformed model from above, examine the externally studentized residuals. Which mammals have substantially larger brain weights than were predicted by the model? Do any mammals have substantially smaller brain weights than were predicted by the model? Are there any mammals that would be considered significant outliers? Explain. Be sure to describe what hypothesis is being tested in the outlier test and how you reach your conclusion. Summarize in one paragraph, with any output, graphs, or calculations clearly labeled in an appendix.
DON'T FORGET TO ANSWER the CONCEPTUAL EXERCISES FOR THE CHAPTER!
First download and read in the data set, Case0902.asc, into S-Plus. In the options tab for reading in a file, specify 5 in the field for the Name Column. This sets up this column for use in labeling points.
Create log transformations of all variables (log.brain, log.body, log.gest, log.litter)
To obtain the case diagnostics, we will need to fit the model using the command line mode, rather than through the menus. To fit the first model, enter the following command in the Command window:
brain.lm <- lm(brain ~ gest + body + log.litter, qr=T, data=Case0902)
This fits the linear model (lm) using the dataframe Case0902 and stores the results in the object "brain.lm". In S-Plus the "<-" is an assignment operator, telling S-Plus to store the output of a function or command in the left-hand side variable. The qr=T is an option that is necessary for the next command that creates the case diagnostics. If you want to obtain the parameter estimates, etc, use the command summary(brain.lm)
To obtain the case diagnostics, enter the next command in the command line window:
brain.diag <- ls.diag(brain.lm)
This will create an object brain.diag that contains leverage, Cook's distance and Externally Studentized residuals. To add these to the data frame, enter the commands:
Case0902$leverage <- brain.diag$hat
Case0902$cooks <- brain.diag$cooks
Case0902$stud.res <- brain.diag$stud.res
in the command window. These will now be available for plotting and identification of influential cases with the names leverage, cooks, and stud.res in the dataframe.
To create a plot like the Cook's Distance plot in the regression plot output, go to the 2D graph menu and select High Density Line Plot. To label points, bring up the Graphics Tool Bar. This is the icon with two graphs, next to the triangle, circle and square icon. Click on the Label Point button in the top row of the Graph Tools Palette; the button that looks like an A. Clicking on points will label them with their row labels. To add more than one label, use a shift-click to add the label.
To refit the model deleting certain cases, say case 72, use the command,
brainsub.lm <- lm(brain ~ gest + body + log.litter, subset=c(-72), qr=T, data=Case0902)
brainsub.diag _ ls.diag(brainsub.lm)
The - 72 means omit case 72. Others can be deleted by adding more numbers to the list, separated by commas ie. C(-2, -72, -75) omits cases 2, 72, and 75.
Use the output for the subset model, brainsub.lm, to create diagnostics as in step 4 above. (be sure to change the name for the variable names for storing the output in the dataframe, or else you will write over your previous results.) To create the plot, without saving the results in the dataframe Case0902, you may enter the plot command directly in the command window. (Note: the labeling feature, will only give you the row number, and not the mammal name this way)
plot(brainsub.diag$cooks, type="h", xlab="Index", ylab="Cook's Distance")
Repeat using the model with all variables transformed using logs.
logbrain.lm <- lm(log.brain ~ log.gest + log.body + log.litter, qr=T, data=Case0902)
logbrain.diag <- ls.diag(logbrain.lm)
plot(logbrain.diag$cooks, type="h", xlab="Index", ylab="Cook's Distance")
Using the model with all variables transformed using logs, plot the externally studentized residuals to identify cases that are either over or under predicted by the model:.
plot(logbrain.diag$stud.res, xlab="Index", ylab="Externally Studentized Residual")
Label the points. (adding the variable stud.res to the dataframe, will allow you to use the row labels with the mammal names to label points, which might look nicer; again be careful to use a new name so that you don't write over the other variables - unless you want to!). To determine if there are outliers, the following plot for the p-values helps. The df in the pt function is the residual df - 1; in the outlier model we need to estimate one additional parameter for that case, and hence lose one more df. P-values less than 0.05/n are considered outliers in the sense that they come from a population with a different mean than that defined by the multiple regression model. The abline function below adds a horizontal line (h=) at the cutoff.
plot(2*(1-pt( abs(logbrain.diag$stud.res), 91)), xlab="Index", ylab="p-value for Studentized Residual")
abline(h=.05/96)