STA242/ENV255 Homework 4

Sta 242 / Env 255

Homework 4

Due either Thursday, March 7 in class
or by 5pm on Friday, March 8 to 223C Old Chemistry

Last modified: Fri Mar 1 14:03:51 EST 2002

This homework corresponds to Chapter 9 of Sleuth.
To maximize your grade, read these homework guidelines.
Report all fitted regression lines in the format of the box above Section 7.4 (on page 180).

The pollen removal data, continued. Assuming the parallel lines model is true, is there evidence that, after accounting for the amount of time on the flower, queens tend to remove a smaller proportion of pollen than workers? Perform a hypothesis test, giving test statistic and p-value. Give a confidence interval for the difference in the logit of the proportion of pollen removed.
Some red spruce forests in the Appalachian Mountains show signs of decline, with many dead or dying trees. Environmental stress may contribute to this decline; deposition of airborne pollutants such as metals or acids tends to be heavier at higher elevations, where red spuce predominate. The dataset, spruce.txt, contains data on elevation and the percentage of dead or badly damaged trees, from 64 Appalachian sites (Johnson and Siccama, reported by the Committee on Monitoring and Assement of Trends in Acid Deposition, 1986). Eight of the sites are in southern states (West Virginia, Virginia and North Carolina); the remainder are northern (New Hampshire, Vermont and New York).
Dataset: "spruce.txt" describing elevation and percentage dead or damaged red spruce trees
- Column 1: Location (0=South, 1=North)
- Column 2: Elevation (meters)
- Column 3: % Trees Damaged
You will write a 1-page summary of your analysis of this dataset, in the same format as Homework 2. Research questions: Use the data to describe the effect of elevation on percentage of damaged forest. What is the role of region (North or South) in this analysis? Is the relationship between elevation and percentage of damaged forest the same for the North and South, or does it change according to region? For whatever model you choose, give confidence intervals for the slope in both regions. Also give confidence intervals for the percent of damaged trees (or some transformation of percent damaged) at an elevation of 1200 m for the North and the South.
- Find a model that best describes the relationship and interpret its meaning for the North and South.
- You'll need to examine residual plots (residuals vs. fitted and normal QQ plots) at each step as you investigate the relationship.
- Coded scatterplots will be very helpful in choosing a model. To code a point in the North as "N", use the script we have used before, with this modification: "points(Elevation[Location==1],Damage[Location==1],pch="N")"
- Coded residual plots may also be helpful. You can save the residuals and fitted values from a fitted model and plot them against the fitted values by making slight adjustments to the commands for coded scatterplots (replace X1 and Y with your fitted values and residuals, respectively, and again code by Location).
- As you write up your exploratory data analysis and statistical analysis section, describe the features of the data that lead to your modeling choice. The majority of the statistical analysis section should focus on interpretations of the model you have chosen (and answers to the specific questions above).
- In order to get confidence intervals for both slopes and for the percent damaged at 1200 m, you will need to think about:
  - Recentering the data.
  - Recoding the data. For example in a separate lines model, letting "Location=0" mean "North" will give you all you need to do the necessary calculations for the North. If you recode to "Location=0" for the South, you'll get all you need to do the necessary calculations for the South.
- You will make most efficient use of your time by working on this problem incrementally, and spreading your work/discussion with TAs over a few days. Spend some time exploring the data, models, transformations.
Brief analysis plan for your project.
- Please read this information on poster presentations.
- Directions for analysis plan. Submit a 1-2 page analysis plan (separately from the other part of the homework). It is worth 5% toward your project grade, and is intended for the TAs and me to make sure you have a project that is appropriate for the class. The plan will have the following components:
  - Project title and group members
  - Description of the data you will use. Where found? How collected? How does the data relate to the research question? What role will each variable play in exploring the general research question? Give the outcome (dependent, response, Y) and predictor (independent, X) variables you will use to answer the questions.
    You'll submit a table summarizing the variables of interest. Column 1: Variable Name. Column 2: Indicate whether continuous, discrete, categorical. Column 3: units of the variable. Column 4: Number of observations for this variable.
  - Basic features of the data you will use. Include at least a couple of scatterplots showing general features of the data. Pairs plots are OK to submit here.
  - The general questions you will answer, and hypothesized answers (i.e. what do you expect to see?). What results from these specific statistical methods are needed to support your hypothesized answer?
  - The statistical method(s) that you will use to help answer the question.
  - If the data were not provided by me, attach a copy of the dataset with labeled variables or you may provide the link to the web address where the data is located. If your dataset is large, handing in a subset of observations is acceptable.
  The Duke Honor Code applies in our course and to these projects. It is assumed that you will not collaborate with students who may have worked with the same datasets in past regression courses. We will compare current posters to past posters if Honor Code violations are suspected.