Sta 242 / Env 255

Homework 2

Due Tuesday, January 30, at class



  1. Burning fuels produce air pollutants that return to earth as acid rain, affecting wildlife in lakes and streams. Declines in aquatic life have occurred downwind of industrial regions in parts of Europe, the Rocky Mountains, and the northeastern U.S.

    The following table lists data from 15 tributaries of Millers River in north-central Massachusetts. For each stream, we know the average pH and the number of fish species observed during the summer of 1983. pH measures acidity; the more acidic the stream, the lower its pH. Pure water has a pH of 7.0; vinegar is about 3.0.

    The data in acidfish.txt are taken from D. Halliwell (Committee on Monitoring and Asessment Trends in Acid Deposition, 1986) and consist of:

    We wish to explore the relationship between acidity and fish species.

    1. Calculate the linear correlation coefficient, r. A range of pH 6.5 to pH 8.2 is optimal for most organisms. Suppose an EPA scientist wishes to rescale the data relative to a lower bound for pH. If 6.5 is subtracted from each of the average summer pH values, will the correlation coefficient change? Why or why not?

    2. Produce a regression line plot of fish species (Y) versus average pH (X). For this homework, all plots go on a single page.

    3. Regress fish species (Y) on average pH (X). Write out the fitted regression equation in the format of the box above Section 7.4 (on page 180). In one sentence each, interpret the slope, intercept and R2. (Look under "Example - Big Bang" on p. 173 for an example of such interpretations.)

    4. Test whether there is evidence of a linear association between number of fish species and average summer pH. Use alpha=0.05. Write out hypotheses, rejection region, and test statistic. Also, give the one-sided and two-sided p-values and comment on which is more appropriate for this example.

    5. Construct by hand a 95% confidence interval for the slope. How does this confirm the two-sided test result of the previous problem?

    6. Construct by hand a 90% confidence interval for the mean number of fish species in streams with pH=6.0 and with pH=8.0.

      Produce a plot of the confidence intervals for the mean response at each level of the x-variable. Under "Graph", "2D Plot", "Fit-Linear Least Squares", specify the x-Column and the y-column. Then click on the "By Conf Bound" tab. Choose "Confidence 0.90" at the bottom. Then choose a line style (other than "None"), color and width. Click on OK. For this homework, all plots go on a single page.

    7. Calculate by hand 90% prediction intervals for the number of fish species in a stream with pH=6.0 and with pH=8.0.

    8. Verify the assumptions on the residuals of the regression by creating 2 plots: a plot of residuals vs. fitted values, and a qq normal plot of residuals. Write out the assumptions and discuss how well they are met in this example. For this homework, all plots go on a single page.

    9. What streams correspond to the two observations with the lowest pH and smallest number of species? Repeat the regression without these two values and create a second regression line plot. Discuss how these cases affect your results. Should they be left in or eliminated in drawing final conclusions about fish and pH? The second regression line plot goes on the single page of plots for this homework.

    10. Does the correlation coefficient give a fair estimate of the strength of the association between number of species and pH? Why or why not?

      • One issue to think about is "ecological correlation," which is summarized here. Look halfway down the web page under the heading "Ecological Correlation" and note the applet you can use to choose variables and look at different datasets. A handout on ecological correlation is also available on CourseInfo under "Assignments" and "HW2".
      Under what conditions would it be more appropriate to use the median or minimum of pH values as the x-variable?

  2. Problem 24, page 140 of Statistical Sleuth. Data is in zinc.txt. This problem is an example of some of the more unstructured data analysis problems you will see during the semester, particularly on the take-home midterm and on your final project. For this problem, you will produce a one-page analysis maximum. As the semester progresses, it will be assumed that you understand how to structure such a write-up of your work.

    Put your writeup on one page maximum. You will hand it in separately, so make sure your name is on it.

    Structure of the Writeup

    1. Clear statement of research question

    2. Description of data. If observational, how was the data collected? What was the experimental design? Mention any sampling issues (indendence or possible correlation of data, sample size issues) that could have an impact on the analysis.

    3. Exploratory analysis of data

      • summary statistics giving center and spread of data. These can be reported in a simple table or in a sentence.

      • relevant plots tailored to the question at hand. In this homework problem, a boxplot will suffice to give a sense of how the means differ. One way to make this plot more informative would be to provide the sample sizes for each group (this can be added as text to the boxplot).

      • unique features of the data that could have an impact on the analysis: outliers, shape of distribution, sample size issues.

    4. Statistical analysis. This section includes information on the statistical tests performed, and should be written in a form similar to the "Case Studies - Summary of Statistical Findings" section at the beginning of each chapter in Statistical Sleuth.

      • Statistical modeling/analysis. For this problem, specify a one-way ANOVA model and clearly give the hypotheses being considered. Then perform the planned comparison between the pregnant vegetarian and non-vegetarian groups by forming a confidence interval.

      • Diagnostic checks on the method used. For this problem, a plot of residuals versus fitted values will suffice (see p. 126); you can include the plot in your 1-page writeup or write a sentence summarizing what it shows. For this problem, you should comment on the equal variances assumption of the ANOVA and whether it is met by the data; it may be that the small sample sizes prevent you from answering this question definitively.

      • Sensitivity analysis. Are certain observations exerting significant influence on the analysis? Run the models with and without the points of concern.

    5. Summary of findings and scope of inference. To what extent does our data answer the question asked? What are the limitations of the model you have selected? What advice can you give to decisionmakers about the problem at hand? Review Section 1.2 of Statistical Sleuth. Also the "Scope of Inference" sections of the "Case Studies" should be of use here. Remember that you will not always see perfect "textbook" datasets, and that this particular dataset may be all that is available to answer the question.