STA242/ENV255 Homework 6

STA242/ENV255

Homework 6

Problem 1 Due: Friday, March 9th, 5pm (Put homework in Sandra's mailbox in 211 Old Chemistry Building.)

Problem 2 Due: Monday, March 19th, 5pm (Put homework in Sandra's mailbox in 211 Old Chemistry Building.)

Please review the Homework Policies.
Conceptual exercises in Chapter 11 are assigned as review of key concepts. The answers are at the end of each chapter.
Please put all plots on pages at the end of your homework.

Refer to the article, "Postnatal growth and allometry of harbour porpoises from the Bay of Fundy," on E-reserve for our course under "McBride".

Guide to reading the article:
- First of all, this is a great review of concepts from Ch. 7-10 of Sleuth.
- Note the organization of the paper by skimming the names of the section and subsection titles. Each section of the paper has 3 subsections, an "Analysis of Growth" subsection, an "Allometry" subsection, and a "Prediction of mass" subsection. The "Analysis of Growth" subsection uses a non-linear least squares procedure to fit Gompertz growth curves to the data; we won't focus on this in this homework. Our focus is on the "Allometry" and "Prediction of mass" analyses. Skim the subsections entitled "Analysis of Growth" but read "Allometry" and "Prediction of mass" subsections carefully.
- With a focus on "Allometry" and "Prediction of mass" subsections, read carefully pages 122-127, and the subsections of the Discussion called "Allometry" and "Prediction of mass" on page 129.
- Review Figure 1 and Table 1 so that you know what the variables are.
- The equations we will focus on interpreting are numbered [2], [3], [4] and [5].
- You do not have to understand all aspects of the article to do this homework. (For example, I don't expect you to know much about Gompertz growth curves or Kolmogorov-Smirnof tests.) Further, the answers to the homework problems below for the most part won't be found embedded in an obscure sentence of the paper. The point of this homework is for you to apply what we have learned in the context of a published reference.
1. In equation [2] on page 123, the standard allometric model is given in terms of logs of the y and x variables. Give the implied equation for y as a function of x on the original scale of measurement. Hint: you'll need to start by exponentiating on both sides and simplifying using the basic rules of logs.
2. In terms of the parameters of the model in [2], what does "exhibited negative allometry" mean?
3. Consider performing an analysis on the set of 203 females only, using measurments of standard length and girth at eye (m11 in Table 1). Let x=standard length and y= girth at eye. We wish to test for evidence of negative allometry. Assume that Model [2] has been fit in order to answer this question.
  
  Provide:
  1. relevant hypotheses;
  2. In Table 4, refer to the simple linear regression results for the females in the row corresponding to "m11". Use this information to calculate the value of the test statistic.
  3. Give the rejection region for alpha=0.05;
  4. Write in a sentence the result of your test.
  5. Write out the probability statement for the p-value in this case. (An example of this for the pollen problem is given in page 2, line 10, of the HW4 solutions.) Give the p-value.
4. In Model [3], the authors describe a model applied to the log-transformed data (203 females and 198 males). Since their notation is somewhat confusing, we rewrite the model as:
  log(y)=a + b₁ S + b₂ log(L) + b₃ S:(log(L))
  
  where
  y = a particular character measurement such as those given in Table 1
  
  L = standard length
  
  S is an indicator variable for the sex of the porpoise, with male=0 and female=1.
  1. For this model, give the number of degrees of freedom for your estimate of the model error.
  2. Describe how you would test for "homogeneity of slope between the sexes". Give hypotheses, form of test statistic (written as a formula; you don't need to substitute actual estimates) and rejection region for alpha=0.05.
  3. Does your answer to (2) also answer the question of whether the linear equation for males and females is the same? Yes or no and explain.
5. Write a one-sentence interpretation of the coefficient of "log m14" in equation [5]. Make sure your sentence is understandable to the non-statistician. Hint: Consider the effect on mass of doubling m14 and remember that this is a multivariate regression.
6. At the bottom of page 125, the authors make the following statements about "m6":
  1. The growth rates were significantly different between the sexes for m6 ...; females exhibited a higher growth rate than males for ... m6....
  Assume the following model was fit:
  
  log(m6)=a + b₁ S + b₂ log(L) + b₃ S:(log(L))
  Describe as precisely as you can what statements (1) and (2) say about the coefficients in the model above.
The purpose of this homework problem is to learn how to construct case diagnostics, such as leverage, Cook's distance and externally studentized residuals in Splus. In addition we will explore how transformations of variables may impact the case diagnostics. You will need to understand these concepts for the take-home midterm.

Refer back to the Brain weight data from Case Study 9.2.
1. Fit the regression of brain weight on body weight, gestation, and log litter size. Carefully write out the fitted model.
2. Use a statistical test to determine whether gestation period and log(litter size) are associated with brain weight after accounting for body weight. (Hint: Think about what this question is asking for; understanding these types of questions is crucial for the class.)
3. Obtain a set of case diagnostics for this model. Is any mammal influential in this fit?
4. Refit the model without the influential observation, and obtain a new set of case diagnostics. Are there any influential observations from this fit?
5. Repeat with all observations, but use log transformations of all variables. Are there any influential cases?
6. What lessons about the connection between the need for transformations and influence can be discerned? Summarize your answers in one paragraph; turn in any output, calculations or graphs on a separate sheet (you may refer to them in the summary). All output should be clearly labeled with extraneous material removed. The write up should focus only on the case diagnostics and role of transformations. You DO NOT need to interpret the models.
7. Identifying which mammals have larger brain weights than were predicted by the model might point the way to further variables that can be examined . Using the log transformed model from above, examine the externally studentized residuals.
  
  Which mammals have substantially larger brain weights than were predicted by the model? Do any mammals have substantially smaller brain weights than were predicted by the model? Are there any mammals that would be considered significant outliers? Explain. Be sure to describe what hypothesis is being tested in the outlier test and how you reach your conclusion. Summarize in one paragraph, with any output, graphs, or calculations clearly labeled in an appendix.
DON'T FORGET TO ANSWER the CONCEPTUAL EXERCISES FOR THE CHAPTER!

Specific Instructions for Obtaining Output for Problem #2
1. First download and read in the data set, Case0902.asc, into S-Plus. In the options tab for reading in a file, specify 5 in the field for the Name Column. This sets up this column for use in labeling points.
2. Create log transformations of all variables (log.brain, log.body, log.gest, log.litter)
3. To obtain the case diagnostics, we will need to fit the model using the command line mode, rather than through the menus. To fit the first model, enter the following command in the Command window:
  
  brain.lm <- lm(brain ~ gest + body + log.litter, qr=T, data=Case0902)
  
  This fits the linear model (lm) using the dataframe Case0902 and stores the results in the object "brain.lm". In S-Plus the "<-" is an assignment operator, telling S-Plus to store the output of a function or command in the left-hand side variable. The qr=T is an option that is necessary for the next command that creates the case diagnostics. If you want to obtain the parameter estimates, etc, use the command summary(brain.lm)
4. To obtain the case diagnostics, enter the next command in the command line window:
  
  brain.diag <- ls.diag(brain.lm)
  
  This will create an object brain.diag that contains leverage, Cook's distance and Externally Studentized residuals. To add these to the data frame, enter the commands:
  
  Case0902$leverage <- brain.diag$hat
  
  Case0902$cooks <- brain.diag$cooks
  
  Case0902$stud.res <- brain.diag$stud.res
  
  in the command window. These will now be available for plotting and identification of influential cases with the names leverage, cooks, and stud.res in the dataframe.
  
  To create a plot like the Cook's Distance plot in the regression plot output, go to the 2D graph menu and select High Density Line Plot. To label points, bring up the Graphics Tool Bar. This is the icon with two graphs, next to the triangle, circle and square icon. Click on the Label Point button in the top row of the Graph Tools Palette; the button that looks like an A. Clicking on points will label them with their row labels. To add more than one label, use a shift-click to add the label.
5. To refit the model deleting certain cases, say case 72, use the command,
  
  brainsub.lm <- lm(brain ~ gest + body + log.litter, subset=c(-72), qr=T, data=Case0902)
  
  brainsub.diag _ ls.diag(brainsub.lm)
  
  The - 72 means omit case 72. Others can be deleted by adding more numbers to the list, separated by commas ie. C(-2, -72, -75) omits cases 2, 72, and 75.
  
  Use the output for the subset model, brainsub.lm, to create diagnostics as in step 4 above. (be sure to change the name for the variable names for storing the output in the dataframe, or else you will write over your previous results.) To create the plot, without saving the results in the dataframe Case0902, you may enter the plot command directly in the command window. (Note: the labeling feature, will only give you the row number, and not the mammal name this way)
  
  plot(brainsub.diag$cooks, type="h", xlab="Index", ylab="Cook's Distance")
6. Repeat using the model with all variables transformed using logs.
  
  logbrain.lm <- lm(log.brain ~ log.gest + log.body + log.litter, qr=T, data=Case0902)
  
  logbrain.diag <- ls.diag(logbrain.lm)
  
  plot(logbrain.diag$cooks, type="h", xlab="Index", ylab="Cook's Distance")
7. Using the model with all variables transformed using logs, plot the externally studentized residuals to identify cases that are either over or under predicted by the model:.
  
  plot(logbrain.diag$stud.res, xlab="Index", ylab="Externally Studentized Residual")
  
  Label the points. (adding the variable stud.res to the dataframe, will allow you to use the row labels with the mammal names to label points, which might look nicer; again be careful to use a new name so that you don't write over the other variables - unless you want to!). To determine if there are outliers, the following plot for the p-values helps. The df in the pt function is the residual df - 1; in the outlier model we need to estimate one additional parameter for that case, and hence lose one more df. P-values less than 0.05/n are considered outliers in the sense that they come from a population with a different mean than that defined by the multiple regression model. The abline function below adds a horizontal line (h=) at the cutoff.
  
  plot(2*(1-pt( abs(logbrain.diag$stud.res), 91)), xlab="Index", ylab="p-value for Studentized Residual")
  
  abline(h=.05/96)

Specific Instructions for Obtaining Output for Problem #2