STA242/ENV255 Homework 3

Sta 242 / Env 255

Homework 3

Due Thursday, February 28 in class

Last modified: Thu Feb 28 16:46:34 EST 2002

This homework corresponds to Chapter 9 of Sleuth. You are responsible for Conceptual Exercises in Chapter 9.
To maximize your grade, read these homework guidelines.
Please put all plots on pages at the end of your homework.
Report all fitted regression lines in the format of the box above Section 7.4 (on page 180).

Regression with Indicator Variables: Pollen removal data. These data are described in Chapter 3 of Sleuth, pages 75-77 (Display 3.11), as well as page 252, problem 16. (Do not follow the exercises in the book.)
Data in pollen.txt. Variables: "removed", "duration", "code" (0=queens; 1=workers).
What you should learn from this problem: Familiarity with a logit transformation; interpreting the meaning of coefficients in this type of model. Walking through consideration of three types of models: single line, parallel lines, and non-parallel lines. Knowledge of how to plot these models will be useful to your projects. Understanding dynamics of how p-values change with addition/deletion of variables. Often a categorical variable (such as queen/worker status) can be a "lurking" variable in a regression problem; ignoring it can be misleading. Many regressions you will see this semester will contain both categorical and continuous variables.

These problems (I-IV) are strongly recommended, but are not required to turn in. You can work in groups on this; the important thing is to understand how the logit transformation changes the fit of the regression. Also, you should recognize that I-IV represent a sequence of steps you might follow to choose a model.
1. Consider the model removed~duration. (Make residual vs. fitted plots as well as QQnormal plots, but don't turn these in.)
  1. Draw a coded scatterplot with regression line superimposed of proportion of pollen removed versus duration of visit.
  2. Write out the fitted model.
2. Consider the model log(removed/(1-removed))~duration. (Make residual vs. fitted plots as well as QQnormal plots, but don't turn these in.) For the case in which the response variable is a proportion (between 0 and 1), the logit transformation is useful. If p is the proportion, then the logit transform is log[p/(1-p)]. This is the log of the ratio of the amount of pollen removed to the amount not removed. We can refer to it as the "log of the pollen removal ratio", or the "logit of the proportion of pollen removed".
  1. Create the logit transform of proportion of pollen removed, and draw a coded scatterplot with regression line superimposed of the logit(proportion of pollen removed) versus duration.
  2. Write out the fitted model.
3. Consider the model log(removed/(1-removed))~log(duration). (Make residual vs. fitted plots as well as QQnormal plots, but don't turn these in.)
  1. Draw a coded scatterplot with regression line superimposed of the logit(proportion of pollen removed) versus log(duration).
  2. Write out the fitted model.
4. Review your scatterplots, as well as residual plots. Of the plots in (a), (b) and (c), choose the most appropriate transformation and explain why in a couple of sentences.
Now we will focus on the model in which the logit of p is a linear function of the log of duration. Recall that this means that p is a non-linear function of duration. The slope of the fitted line will indicate whether p is an increasing or decreasing non-linear function of duration. A final writeup of such a study would include information about the function fitted (plotted on original scale) as well as a qualitative description of the behavior of p as a function of duration and type of bee.
1. All problems below should be turned in. We will consider the model, log(removed/(1-removed))~log(duration), in depth.
1. Fit the following 3 models and put the report output on a separate page.
  - Model A: Fit the "Equal Lines" model log(removed/(1-removed))~log(duration)
  - Model B: Fit the "Parallel Lines" model log(removed/(1-removed))~log(duration)+code
  - Model C: Fit the "Separate Lines" model log(removed/(1-removed))~log(duration)*code
2. Based on the output for Model A, what evidence is there that the regression line a significant improvement over an "equal means" model? (This is always your first step.)
3. We will test the hypothesis:
  Ho: The effect of duration on the fraction of pollen removed is the same for both types of bees. (same model for both types of bees; for transformed data we have same slope, same intercept)
  Ha: While the linear relationship between log(duration) and the logit of proportion pollen removed is the same for the two types of bees (same slope) , the logit of the proportion of pollen removed for workers at each level of log(duration) is different from that of queens. (different intercepts) Write out the fitted model that corresponds to the null hypothesis, and the fitted model that corresponds to the alternative hypotheses. Perform the test suggested by the hypotheses above using the model output for Model B. What coefficient are you testing? Give hypotheses in statistical notation (greek), test statistic, p-value and conclusion. Which model is more appropriate?
  How do the hypotheses differ when we wish to see whether the amount by which the logit of proportion of pollen removed for workers exceeds that of queens, after accounting for the effect of duration? What is the p-value for this test?
  Is the p-value for the significance of log(duration) term different in Model A than in Model B? Why?
4. Now we will consider whether the effect of duration on proportion of pollen removed is different for workers and for queens. This involves consideration of an interaction term and investigation of its significance.
  Ho: While the linear relationship between log(duration) and the logit of the proportion of pollen removed is the same for the two types of bees (same slope), the logit of the proportion of pollen removed at each level of log(duration) differs among the type types of bees. (different intercepts)
  Ha: The linear relationship between log(duration) and the logit of the proportion of pollen removed is different for workers and for queens. (different slopes, different intercepts)
  Write out the fitted models implied by the null and alternative hypotheses. Perform this test by comparing the model output for Models B and C. What coefficient are you testing? Give hypotheses in statistical notation (greek), test statistic, p-value and conclusion. Which model is more appropriate? Why is the p-value for the significance of the indicator variable so different in this model than in the one with the interaction term?
5. Draw a coded scatterplot with regression line superimposed for Model B. Put all plots on a separate page.
6. Now we will focus on interpreting the parallel lines model.
  1. Interpret the intercept both on the transformed scale and on the original scale. Note that by interpreting the intercept, you are extrapolating.
  2. Use Model B to write a sentence on the effect of a doubling in duration on the median removal ratio (p/(1-p) )for workers along with its associated confidence interval. First calculate a CI for the change in the mean of the logit(proportion of pollen removed). Then back transform (just exponentiate) to put the interval in terms of the median removal ratio. What about for queens?
    
    A note: Hypothesis tests do not tell the whole story; "effect sizes", given in terms of confidence intervals are more informative.
  3. Estimate the median proportion of pollen removed for a queen who visits the flower for 20 seconds and give a 95% CI for this amount. You will need to recenter the data by subtracting "log(20)" from each of the log transformed observations. That is, you will run the model log(removed/(1-removed))~I(log(duration)-log(20))+code. Use your results to form a confidence interval for the mean(log(p/(1-p))). To back transform the endpoints of this interval, you need to know (verify this) that if logit(p)=p/(1-p)=X, then p=exp(X)/(1+exp(X)).
7. What happens to the R2 term as you move from Model A to Model C? Why?
- Optional but strongly suggested: Make a plot of the fitted model on the original scale of measurement. This can be accomplished using the command line. For the model considered in (1), we can write Y=proportion pollen removed as a function of X=duration as follows.
  p= ( exp(beta0) X^beta1 ) / ( 1 + exp(beta0) X^beta1 ) )
  See Splus directions

Multiple regression with continuous X variables: Pace of life and heart disease. These data are described on page 251 of Sleuth, problem 14.

Data in EX0914.ASC. Variables: "bank", "walk", "talk", "heart".

The model statement you will enter in Splus is: heart~bank+walk+talk
Follow the exercises in the book in addition to the supplement below.
Note that all variables are standardized, that is, the mean is subtracted from each value and is divided by the standard deviation. This means that if we talk about a one-unit change in X, we are talking about a one sample standard deviation change in X. The same goes for Y.
Make a scatterplot matrix for part (a). (Splus won't let you put heart on the vertical axis, I don't think.) Put all plots on a separate page.
Additional part (e): Give one sentence interpretations for each regression parameter, using careful language. Holding "bank clerk speed (bank)" and "postal clerk talking speed (talk)" constant, what is the effect of a one sample standard deviation increase in pedestrian walking speed on mean death rate due to heart disease?