Sta 242 / Env 255

Homework 1

Due Thursday, January 31, in class





  1. Dissolved oxygen analysis measures the amount of gaseous oxygen (O2) dissolved in an aqueous solution. Dissolved oxygen can range from 0-18 parts per million (ppm), but most natural water systems require 5-6 parts per million to support a diverse population. As dissolved oxygen levels in water drop below 5.0 ppm, aquatic life is put under stress.

    10 replicate water samples were taken at each of four locations in a river to determine whether the quantity of dissolved oxygen varied from one location to another (the higher the level of pollution, the lower the dissolved oxygen reading).

    Location 1 was adjacent to the wastewater discharge point for a certain industrial plant. Locations 2, 3, 4 were selected at points 10, 20, and 30 miles downstream from this discharge point. The resulting data appear in water.txt, where Column 1 is the dissolved oxygen concentration in ppm (I'll call it "DO"), and Column 2 is the location number.

    During the semester, you will be asked to write ½ page summaries of unstructured data problems, as well as do a take-home midterm and project with a similar format. Below, you will step through the components of a typical writeup for this course. For this problem, put all plots on 1 page, and label it "Problem 1 Plots"

    Save the dataset to your computer. Go to "File" and "Import Data" and "From File". Then select the directory that the file is in and select the format under "Files of type" by choosing "ASCII". Click on the column name, "V1", which is the default. Type in an appropriate variable name. Save the dataset. I'll call it "water" from here on.

    1. Introduction: What is the research question? How are the data collected? Refer Display 1.5 to describe the study design.

    2. Create a table with sample sizes, group means, group medians, standard deviations and interquartile ranges for each group, as well as the 5th and 25th percentiles for each group. Compute the overall mean. (Why might the lower percentiles may be of interest?)

      Go to "Statistics" - "Data Summaries" - "Summary Statistics". Choose the variable name you would like to summarize. Click on the tab called "Statistics" and check these boxes, "mean", "confidence interval for mean", "1st quartile", "median". To get the 5th quantile, go the the command line (go to "Windows" and "Command Line") and type at the ">" prompt:

      attach(water)

      quantile(DO,0.05)

    3. Make side by side boxplots of the four groups. Label axes with units.

    4. Make normal probability plots of the each of the four groups. Note any outliers or systematic departures from normality. Newsgroup post on this.

    5. Exploratory Data Analysis Section: Write a few sentences summarizing the features of the data. This means: Refer to the components of the summary statistics and graphs above to make your points. How would you describe the differences in means and the differences in variability between the four groups? Are there any unusual observations? Which ones are they? Do the data appear normally distributed or is your sample size too small to definitively determine this? Can you make a preliminary conclusion about the research question based on your exploratory data analysis?

    6. Statistical Analysis Section: Perform a one-way ANOVA to determine whether there is sufficient evidence that values for mean dissolved oxygen content differ significantly among the four locations.

      1. Write out the null and alternative hypotheses using statistical notation. Then write them in words, in terms of the problem.

      2. Construct an analysis of variance table for the data analogous to Display 5.9 on p. 121. To run the one-way ANOVA, go to Statistics and ANOVA and Fixed Effects.

      3. Summarize the result of the test in a couple of sentences, giving the F-statistic, its p-value and your interpretation of the results. To summarize your test results, use the format of the Case Study reports in Chapter 5 (Section 5.1). Make sure you write your conclusion in terms of the problem. (Also see the solutions to review problems for an example.)

    7. How big of a difference in mean dissolved oxygen content is seen between location 2 and location 4? Give a confidence interval. (Review Display 5.6, p. 116 of Sleuth.) Summarize your interval in terms of the problem, as in the Case Studies in Chapter 5.

    8. Plot the residuals of the ANOVA against the estimated means. (Review Section 5.4.2 of Sleuth and Conceptual Exercise #7 on page 135.) Note that this plot might be put in the appendix of a report. Summarize what this plot tells you about the fit of the model. How to do it: Under Statistics and Anova and Fixed Effects, enter dependent variable and independent variable. Go to the "Plot" Tab in the ANOVA menu and click "Residuals vs. Fit". This will give a plot analogous to that on page 126 of Sleuth, Display 5.13. Note that under "Options", Splus will label your plot with the observation numbers for a number of extreme points that you specify.

    9. Scope of Inference: The final step is to comment on the scope of inference. Again refer to the Case Studies. You should also examine how well the model meets each of the four ANOVA assumptions, and comment on any problems you see. In particular, critique the assumption of independence between groups in this sampling scheme. Suggest a design that would improve on your concerns.

  2. Please read the article by Munoz-Perez et al., "Cost of Beach Maintenance in the Gulf of Cadiz (SW Spain)". (You can also get a copy via Duke e-Journals, www.lib.duke.edu.) We will use this article to demonstrate some concepts in regression. We will focus on the analysis surrounding the plot on page 151. The data we will use are in cadiz.txt. For this problem, put all plots on a single page, labeling it "Problem 2 plots".

    1. What research questions are being answered by the analysis in Figure 4? How are these questions answered in the conclusions section?

    2. How were the data for "cost" and "nourishment volume" collected? Is the data based on a random sample or the population? Based on this, can you make conclusions about the scope of inference? How do the authors address the scope of inference in the paper?

    3. Do the data represent a population or a sample? Justify. Newsgroup post

    4. Reproduce Figure 4 in the paper by creating a coded scatterplot (without the regression lines). Make a coded scatterplot. This will also be described in lab.

    5. On page 151 in the second column, the authors fit a regression line to the portion of the data corresponding to trail suction dredgers for which the nourishments are less than 600,000 cubic meters of sand. Note that no justification of this cutoff is given, and according to Dr. O.H. Pilkey in EOS, there is no scientific, policy or even equipment-related reason to split trail suction dredgers into these 2 categories. (Did the authors perform data snooping?) These data are in trail.txt. Produce a regression line plot of these data. Clearly label axes.

    6. Now fit the regression for part (e). Write out the fitted regression line in the format of the box above Section 7.4 (on page 180).

    7. Give a 95% confidence interval for the slope of the line in (f). Based on your plot in (e) and your regression results, do you agree with the authors' finding that "Though the costs are spread out, a simple linear fit by the least squares method shows a decreasing tendency?" (Personally, I wouldn't write that the costs are "spread out" -- I'd mention that the distribution of costs is skewed, with a right tail. Also the spread of costs as a function of volume is what is of interest in the plot, not the costs themselves.)

    8. Now let's check assumptions. Produce a plot of residuals versus fitted values (newsgroup post on this), as well as a normal probability plot of residuals. Are the assumptions of the regression met? State each one and comment. Newsgroup post on normal probability plots.

    We will continue our examination of this paper in the next homework.

  3. Burning fuels produce air pollutants that return to earth as acid rain, affecting wildlife in lakes and streams. Declines in aquatic life have occurred downwind of industrial regions in parts of Europe, the Rocky Mountains, and the northeastern U.S.

    The data come from 15 tributaries of Millers River in north-central Massachusetts. For each stream, we know the average pH and the number of fish species observed during the summer of 1983. pH measures acidity; the more acidic the stream, the lower its pH. Pure water has a pH of 7.0; vinegar is about 3.0.

    The data in acidfish.txt are taken from D. Halliwell (Committee on Monitoring and Asessment Trends in Acid Deposition, 1986) and consist of:

    We wish to explore the relationship between acidity and fish species.

    1. Calculate the linear correlation coefficient, r. Splus directions. A range of pH 6.5 to pH 8.2 is optimal for most organisms. Suppose an EPA scientist wishes to rescale the data relative to a lower bound for pH. If 6.5 is subtracted from each of the average summer pH values, will the correlation coefficient change? Why or why not?

    2. Produce a regression line plot of fish species (Y) versus average pH (X).

    3. Regress fish species (Y) on average pH (X). Write out the fitted regression equation in the format of the box above Section 7.4 (on page 180). In one sentence each, interpret the slope, intercept and R2. (Look under "Example - Big Bang" on p. 173 for an example of such interpretations.)

    4. Test whether there is evidence of a linear association between number of fish species and average summer pH. Use alpha=0.05. Write out hypotheses, rejection region, and test statistic. Also, give the one-sided and two-sided p-values and comment on which is more appropriate for this example.

    5. Construct by hand a 90% confidence interval for the mean number of fish species in streams with pH=6.0 and with pH=8.0.

    6. Produce a plot of the confidence intervals for the mean response at each level of the x-variable. Under "Graph", "2D Plot", "Fit-Linear Least Squares", specify the x-Column and the y-column. Then click on the "By Conf Bound" tab. Choose "Confidence 0.90" at the bottom. Then choose a line style (other than "None"), color and width. Click on OK. For this homework, all plots go on a single page.

      For any regression, we would verify the assumptions on the residuals of the regression by creating 2 plots: a plot of residuals vs. fitted values, and a qq normal plot of residuals. You would also think about the assumptions and discuss how well they are met in this example. Since you did this in Problem 2, you do not have to turn this in. However, such residual analyses are critical to the analysis of the data.

    7. Calculate by hand 90% prediction intervals for the number of fish species in a stream with pH=6.0 and with pH=8.0.

    8. Produce a plot of the prediction intervals for the number of fish species at each level of the x-variable. Directions.

    9. What streams correspond to the two observations with the lowest pH and smallest number of species? Repeat the regression without these two values and create a second regression line plot. Discuss how these cases affect your results. Should they be left in or eliminated in drawing final conclusions about fish and pH?

    10. Does the correlation coefficient give a fair estimate of the strength of the association between number of species and pH? Why or why not?

      • One issue to think about is "ecological correlation," which is summarized here. Look halfway down the web page under the heading "Ecological Correlation" and note the applet you can use to choose variables and look at different datasets.
      Under what conditions would it be more appropriate to use the median or minimum of pH values as the x-variable?