STA242/ENV255 Homework 3

Total points for this homework: 120 (#1,#4,#5) + 40 points extra credit (#2)

Note that you are responsible for all conceptual exercises at the end of Chapters 7-9.

[20 points] Write up solutions to each problem you missed on the midterm. Put your name on each page.
- For true/false questions, give a reason why the problem is true or false.
- If the answer is false, alter the statement so that it is true.
[40 points EXTRA CREDIT] We have two monitors for measuring indoor concentrations of carbon monoxide (CO). Each take measurements every minute. Monitor A is a newer, more accurate and expensive monitor. Monitor B is an older monitor which is cheaper to run. In a quality assurance experiment to verify that the monitors are measuring the same concentrations, both monitors are co-located near a CO source, and are turned on simultaneously. The datafile, monitor.txt, gives CO concentrations (in ppm) measured by Monitor A (column 1) and Monitor B (column 2). Label the columns A and B.
1. Produce a regression line plot of Monitor B versus Monitor A. Do this using the command line:
  
  > attach(monitor)
  
  > par(pty="s")
  
  > plot(A,B, main="Comparison of New (A) and Old (B) \n CO Monitors",xlab="Monitor A (ppm CO)", ylab="Monitor B (ppm CO)",xlim=c(0,75),ylim=c(0,75))
  
  > abline(0,1,lty=1) # solid line, slope 1.
  ##Data will follow the line above if monitors in complete agreement.
  > abline(lm(B~A),lty=2) # dotted line, fitted regression line
2. Regress Monitor B on Monitor A concentration measurements and report the fitted line.
3. Perform a test of whether the slope of the regression is equal to 1. Let alpha=0.05. Use the p-value to make your conclusion, and write a sentence in the format of a writeup summarizing your result.
4. Determine 95% confidence intervals for b₀ and b₁, such that the two confidence intervals simultaneously capture the slope and intercept of the regression line with 95% probability. Use the Bonferroni procedure to construct the intervals: use the Bonferroni t-multiplier on page 163 with k=2 (for 2 parameters) to construct your interval. What do these intervals say about how well the two monitors agree? That is, simultaneously, does the interval for the intercept include zero, and does the interval for the slope include 1?
5. We wish to predict a measurement from Monitor B when Monitor A reads 20 ppm. Give a 95% interval for Monitor B's measurement.
6. We wish to predict a measurement from Monitor A when Monitor B reads 20 ppm. Give a 95% interval for Monitor A's measurement.
Lack of Fit test Practice Problems (not to turn in.)
1. Review Lab Problem #2, dealing with coral head density and distance from the shore. A solution to this problem was given out in which the lack of fit F-test was performed using the formula on page 219. Perform the lack of fit F-test by constructing a composite analysis of variance table as in Section 8.5.4 and in class and confirm that your answer is the same.
2. Work through practice problem 3 (Ecosystem Decay Data) in the master list of practice problems for Chapter 8. For 3(e), construct a composite analysis of variance table to perform the lack of fit test, and compare your answer to that given on the same web page.
[70 points]Regression with Indicator Variables: Pollen removal data. These data are described in Chapter 3 of Sleuth, pages 79 (Display 3.13), as well as page 262, problem 16. (Do not follow the exercises in the book.)
Data in pollen.txt. Variables: "removed", "duration", "code" (0=queens; 1=workers).
What you should learn from this problem: Familiarity with a logit transformation; interpreting the meaning of coefficients in this type of model. Walking through consideration of three types of models: single line, parallel lines, and non-parallel lines. Knowledge of how to plot these models will be useful to your projects. Understanding dynamics of how p-values change with addition/deletion of variables. Often a categorical variable (such as queen/worker status) can be a "lurking" variable in a regression problem; ignoring it can be misleading. Many regressions you will see this semester will contain both categorical and continuous variables.
The logit transformation. If p is the proportion, then the logit transform is log[p/(1-p)]. This is the log of the ratio of the amount of pollen removed to the amount not removed. We can refer to it as the "log of the pollen removal ratio", or the "logit of the proportion of pollen removed". Often this transformation allows us to make the needed assumptions for a regression model. All problems below should be turned in. We will consider the model, log(removed/(1-removed))~log(duration), in depth.
1. Fit the following 3 models and put the report output on a separate page.
  - Model A: Fit the "Equal Lines" model log(removed/(1-removed))~log(duration)
  - Model B: Fit the "Parallel Lines" model log(removed/(1-removed))~log(duration)+code
  - Model C: Fit the "Separate Lines" model log(removed/(1-removed))~log(duration)*code
2. Based on the output for Model A, what evidence is there that the regression line a significant improvement over an "equal means" model? (This is always your first step.) Write your result in a sentence as if you were doing a one-page writeup.
3. We will test the hypothesis:
  updated 2/13/04, 6pm:
  Ho: The model describing the linear relationship between log(duration) and the logit of proportion pollen removed is the same for both types of bees. (for transformed data we have same slope, same intercept.)
  Ha: While the linear relationship between log(duration) and the logit of proportion pollen removed is the same for the two types of bees (same slope), the mean logit of the proportion of pollen removed for workers at each level of log(duration) differs from that of queens. (different intercepts)
  Write out the model that corresponds to the null hypothesis, and the model that corresponds to the alternative hypotheses. Perform the test suggested by the hypotheses above using the model output for Model B. What coefficient are you testing? Give hypotheses in statistical notation (greek), test statistic, p-value and conclusion. Which model is more appropriate?
  How do the hypotheses differ when we wish to see whether the amount by which the logit of proportion of pollen removed for workers exceeds that of queens, after accounting for the effect of duration? What is the p-value for this test?
  Is the p-value for the significance of log(duration) term different in Model A than in Model B? Why?
4. Now we will consider whether the effect of duration on proportion of pollen removed is different for workers and for queens. This involves consideration of an interaction term and investigation of its significance.
  Ho: While the linear relationship between log(duration) and the logit of the proportion of pollen removed is the same for the two types of bees (same slope), the logit of the proportion of pollen removed at each level of log(duration) differs among the type types of bees. (different intercepts)
  Ha: The linear relationship between log(duration) and the logit of the proportion of pollen removed is different for workers and for queens. (different slopes, different intercepts)
  Write out the models implied by the null and alternative hypotheses. Perform this test by comparing the model output for Models B and C. What coefficient are you testing? Give hypotheses in statistical notation (greek), test statistic, p-value and conclusion. Which model is more appropriate? Why is the p-value for the significance of the indicator variable so different in this model than in the one with the interaction term?
5. Draw a coded scatterplot with regression line superimposed for Model B. Put all plots on a separate page. You'll need to modify the code from lab to do this -- change the dataset name and the variable names.
6. Now we will focus on interpreting the parallel lines model.
  1. Interpret the intercept both on the transformed scale and on the original scale. Note that by interpreting the intercept, you are extrapolating.
  2. updated 2/13/04,6pm
    Use Model B to write a sentence on the effect of a doubling in duration on the median removal ratio (p/(1-p) )for workers along with its associated confidence interval. What about for queens?
    (This model is a log-log model, where X=duration and Y=(p/(1-p))=removal ratio, so the interpretation is the same as previous log-log models you have considered.)
    
    A note: Hypothesis tests do not tell the whole story; "effect sizes", given in terms of confidence intervals are more informative.
  3. Estimate the median proportion of pollen removed for a queen who visits the flower for 20 seconds and give a 95% CI for this amount. You will need to recenter the data by subtracting "log(20)" from each of the log transformed observations. That is, you will run the model log(removed/(1-removed))~I(log(duration)-log(20))+code. Use your results to form a confidence interval for the mean(log(p/(1-p))). To back transform the endpoints of this interval, you need to know (verify this) that if logit(p)=p/(1-p)=X, then p=exp(X)/(1+exp(X)).
7. What happens to the R2 term as you move from Model A to Model C? Why?
[30 points]Multiple regression with continuous X variables: Pace of life and heart disease. These data are described on page 260-1 of Sleuth, problem 14, Display 9.14.
Data in EX0914.ASC. Variables: "bank", "walk", "talk", "heart".
Note that all variables are standardized, that is, the mean is subtracted from each value and is divided by the standard deviation. This means that if we talk about a one-unit change in X, we are talking about a one sample standard deviation change in X. The same goes for Y.
The model statement you will enter in Splus is: heart~bank+walk+talk
- Follow the exercises in the book (a)-(d) in addition to the two below.
- (e) Make a scatterplot matrix for part (a). Put all plots on a separate page.
- (f) Give one sentence interpretations for each regression parameter, using careful language. Holding "bank clerk speed (bank)" and "postal clerk talking speed (talk)" constant, what is the effect of a one sample standard deviation increase in pedestrian walking speed on mean death rate due to heart disease?

Last modified: Fri Feb 20 16:34:14 EST 2004