STA242/ENV255
Agenda for Lab, Wednesday 2/25/03
By now you should have read Chapter
9 of Sleuth. We
will focus our discussion today on Case Study 9.1.1, the
Meadowfoam Flowering Example (p. 236). As you review for lab, please
focus on the Case Study as well as Section 9.3.
All students should be able to get through at least 1 and 2 by the end of the lab if they are prepared. Some of you will be able to go through all items during the lab.
How do you define
an indicator, or "dummy", variable? How is it different from a continuous
variable?
Recall
the nematode data (Y=growth of plant, X=number of nematodes, in
thousands). We modeled the relationship between Y and X in two ways,
using ANOVA and using linear regression. Now let's consider only 2
levels of a nematode treatment: 0 and 10000 nematodes. Here
is a plot of the two fitted models when we consider only 2 nematode
groups.
What
is the difference between the fitted values of growth in the two
analyses? No difference in this two sample case. In the
regression in the second plot, yhat equals beta0 when we have 0
nematodes, and yhat equals beta0 plus 10*beta1 when we have 10000
nematodes. This is the same as fitting two means, one to 0 nematodes
and one to 10000 nematodes.
Similarly, I could
code the 0 nematode group as "0", and the 10000 nematode
group as "1". Under the regression of Y on CODE, I would
get that yhat equals beta0 when we have 0 nematodes, and yhat equals
beta0 plus beta1 when we have 10000 nematodes. Of course, the value
of beta1 would be different (10 times the value of b1 above) in this setting, but I
would get the same final answer as above.
So in this simple
example, we see that treating nematodes as a 0-1 variable in
regression only has an additive effect on the intercept when CODE=1.
Now move on to the
Meadowfoam
Data. Variables are:
Explanatory:
Light Intensity (continuous)
Explanatory:
Timing (Late=1 and Early=2).
Response:
Average number of flowers per plant.
Recode the Timing
variable so that Late=0 and Early=1. You can do this by hand, or
look at the functions under "Data".
The following should be a review, so focus on (a), (b) and (d). Do others only if time at the end.- Regress flowers on
intensity.
- Produce a
coded
scatterplot with regression line superimposed. Compare your plot
to the plot at the bottom of Display 9.8, "Equal lines
model".
- Check the residuals (qqnorm
,
residuals
versus fitted) to ensure that regression assumptions are met.
- Interpret the slope. Note that this model implies that the same
relationship between intensity and flowers holds for both timing
levels.
- Increasing light intensity has what effect on the mean
number of flowers per plant? Give a CI.
- Use the
centering trick to investigate the mean number of flowers at a light
intensity of 500, by fitting the model: flowers~I(intensity-500).
New material.
- Now regress flowers on
intensity and timing.
(Click qqnorm plot option, residuals vs. fitted option.)
- What is
the slope for the "at PFI" group?
- What is the slope for
the "before PFI" group?
- What about intercepts for "at
PFI" and "before PFI"?
- Use
your regression output to give the regression equation for the "at
PFI" group and for the "before PFI" group.
- Produce
a coded
scatterplot with regression lines superimposed.
- Compare
your plot to the plot at the middle of Display 9.8, "Parallel
lines model".
- Check the residuals (qqnorm
,
residuals
versus fitted). - Note that this model implies that the effect of
"timing" is to shift the regression line for the "Before
PFI" group up by a fixed amount. What is this amount? How would you get a confidence interval for this amount? Is the
shift statistically significant (test of the coefficient for timing)?
That is, is the parallel lines model preferable to the equal lines
model?
- Looking at the F-statistic and p-value at the bottom of the
"Parameter Estimates" output, what are the null and
alternative hypotheses?
- Use the centering trick to investigate the
mean number of flowers at a light intensity of 500, by fitting the
model: flowers~I(intensity-500)+timing.
Note that the value will depend on the level of the timing variable.
New material.
-
Now fit this regression model:
flowers~intensity*timing. Note that this notation is equivalent to fitting the model: flowers~intensity+timing+intensity:timing
This model
assumes that the effect of light intensity on the number of flowers
depends on whether the plant is "at PFI" or "before
PFI". That
is, this model assumes that there is an interaction between intensity
and timing. - Read Section 9.3.4. Use your regression output to give
the regression equation for the "at PFI" group and for the
"before PFI" group. Note that the 2 regression lines will
differ by
both
slope and intercept; that is, the linear relationships between
flowers and intensity differ according to timing. - Produce a coded
scatterplot with regression lines superimposed.
- Compare your
plot to the plot at the top of Display 9.8, "Separate lines
model".
- Test the significance of the interaction term to
determine if the parallel line or separate lines model is more
appropriate.
The
parallel lines and separate lines models are different from an
analysis where 2 separate regression lines are fit. If I fit 2
separate regression lines, I would be estimating two model errors.
If I fit the parallel lines or separate lines models, I am using all
of the data to estimate a single model error.
- Note that in Chapter 10, Exc. 19, a Lack-of-fit test is performed on the Meadowfoam data. Read through this exercise to make sure you understand how this done. Answer: F=.437 What distribution is this compared to?