STA242/ENV255

Agenda for Lab, Wednesday 2/25/03

By now you should have read Chapter 9 of Sleuth. We will focus our discussion today on Case Study 9.1.1, the Meadowfoam Flowering Example (p. 236). As you review for lab, please focus on the Case Study as well as Section 9.3.

All students should be able to get through at least 1 and 2 by the end of the lab if they are prepared. Some of you will be able to go through all items during the lab.

How do you define an indicator, or "dummy", variable? How is it different from a continuous variable?

Recall the nematode data (Y=growth of plant, X=number of nematodes, in thousands). We modeled the relationship between Y and X in two ways, using ANOVA and using linear regression. Now let's consider only 2 levels of a nematode treatment: 0 and 10000 nematodes. Here is a plot of the two fitted models when we consider only 2 nematode groups.

What is the difference between the fitted values of growth in the two analyses? No difference in this two sample case. In the regression in the second plot, yhat equals beta0 when we have 0 nematodes, and yhat equals beta0 plus 10*beta1 when we have 10000 nematodes. This is the same as fitting two means, one to 0 nematodes and one to 10000 nematodes.

Similarly, I could code the 0 nematode group as "0", and the 10000 nematode group as "1". Under the regression of Y on CODE, I would get that yhat equals beta0 when we have 0 nematodes, and yhat equals beta0 plus beta1 when we have 10000 nematodes. Of course, the value of beta1 would be different (10 times the value of b1 above) in this setting, but I would get the same final answer as above.

So in this simple example, we see that treating nematodes as a 0-1 variable in regression only has an additive effect on the intercept when CODE=1.

Now move on to the Meadowfoam Data. Variables are:

Explanatory: Light Intensity (continuous)
Explanatory: Timing (Late=1 and Early=2).
Response: Average number of flowers per plant.

Recode the Timing variable so that Late=0 and Early=1. You can do this by hand, or look at the functions under "Data".

The following should be a review, so focus on (a), (b) and (d). Do others only if time at the end.
1. Regress flowers on intensity.
2. Produce a coded scatterplot with regression line superimposed. Compare your plot to the plot at the bottom of Display 9.8, "Equal lines model".
3. Check the residuals (qqnorm, residuals versus fitted) to ensure that regression assumptions are met.
4. Interpret the slope. Note that this model implies that the same relationship between intensity and flowers holds for both timing levels.
5. Increasing light intensity has what effect on the mean number of flowers per plant? Give a CI.
6. Use the centering trick to investigate the mean number of flowers at a light intensity of 500, by fitting the model: flowers~I(intensity-500).
New material.
1. Now regress flowers on intensity and timing. (Click qqnorm plot option, residuals vs. fitted option.)
2. What is the slope for the "at PFI" group?
3. What is the slope for the "before PFI" group?
4. What about intercepts for "at PFI" and "before PFI"?
5. Use your regression output to give the regression equation for the "at PFI" group and for the "before PFI" group.
6. Produce a coded scatterplot with regression lines superimposed.
7. Compare your plot to the plot at the middle of Display 9.8, "Parallel lines model".
8. Check the residuals (qqnorm, residuals versus fitted).
9. Note that this model implies that the effect of "timing" is to shift the regression line for the "Before PFI" group up by a fixed amount. What is this amount? How would you get a confidence interval for this amount? Is the shift statistically significant (test of the coefficient for timing)? That is, is the parallel lines model preferable to the equal lines model?
10. Looking at the F-statistic and p-value at the bottom of the "Parameter Estimates" output, what are the null and alternative hypotheses?
11. Use the centering trick to investigate the mean number of flowers at a light intensity of 500, by fitting the model: flowers~I(intensity-500)+timing. Note that the value will depend on the level of the timing variable.
New material.
- The parallel lines and separate lines models are different from an analysis where 2 separate regression lines are fit. If I fit 2 separate regression lines, I would be estimating two model errors. If I fit the parallel lines or separate lines models, I am using all of the data to estimate a single model error.
- Note that in Chapter 10, Exc. 19, a Lack-of-fit test is performed on the Meadowfoam data. Read through this exercise to make sure you understand how this done. Answer: F=.437 What distribution is this compared to?