STA242/ENV255 Homework 1, due Tuesday, January 20th at 12:40pm.

Late homework will not be accepted.

Assignments must be typed and stapled or will not be graded. Mathematical symbols or calculations can be neatly written out, or you can use an equation editor.

You may discuss the problems in HW1 with colleagues, but everything you turn in must be your own.

  1. (5 points) Sleuth, p. 25, problem 25.

  2. (10 points) Data presented in the article "Manganese Intake and Serum Manganese Concentration of Human Milk-Fed and Formula-Fed Infants" (Amer. J. Clinical Nutrituion (1984): 872-878) suggests that a simple linear regression model is reasonable for describing the relationship between y= serum manganese (Mn) and x = Mn intake (micrograms per kg per day). Suppose that the true regression line is y= -2 + 1.4 x, and that sigma = 1.2. Then for a fixed x value, y has a normal distribution with mean (-2+1.4 x) and standard deviation 1.2.
    1. What is the probability that an infant whose Mn intake is 4.0 will have serum Mn greater than 5?
    2. Approximately what proportion of infants whose Mn intake is 5 will have a serum Mn greater than 5? Less than 3.8?

  3. (15 points) A study of cost-effectiveness in public education reported that for a sample of 44 school districts, a regression of y = average SAT score on x = expenditure in thousands of dollars per pupil gave a slope of 15.0 with standard error 5.3.
    1. For these school districts, do expenditures appear to be associated with average SAT scores? Perform a hypothesis test, giving hypotheses, test statistic and p-value. Give your result in a sentence.
    2. Calculate and interpret a 95% confidence interval for the slope in a sentence.
    3. A school board member claims that a $1000 increase in expenditures per child should be associated with at least a 10 point increase in SAT scores. Write out the hypotheses to test this claim, and give/interpret the p-value.

  4. (70 points) Burning fuels produce air pollutants that return to earth as acid rain, affecting wildlife in lakes and streams. Declines in aquatic life have occurred downwind of industrial regions in parts of Europe, the Rocky Mountains, and the northeastern U.S.

    The following table lists data from 15 tributaries of Millers River in north-central Massachusetts. For each stream, we know the average pH and the number of fish species observed during the summer of 1983. pH measures acidity; the more acidic the stream, the lower its pH. Pure water has a pH of 7.0; vinegar is about 3.0.

    The data in acidfish.txt are taken from D. Halliwell (Committee on Monitoring and Asessment Trends in Acid Deposition, 1986) and consist of:

    We wish to explore the relationship between acidity and fish species.

    Start by creating a directory in your Z: drive specifically for this dataset and homework problem, say HW1.acidfish.

    Import the data into Splus

    1. Use Splus to calculate the linear correlation coefficient, r. A range of pH 6.5 to pH 8.2 is optimal for most organisms. Suppose an EPA scientist wishes to rescale the data relative to a lower bound for pH. If 6.5 is subtracted from each of the average summer pH values, will the correlation coefficient change? Why or why not?

    2. Produce a regression line plot of fish species (Y) versus average pH (X). For this homework, all plots go on a single page.

    3. Use Splus to regress fish species (Y) on average pH (X). Write out the fitted regression equation in the format of the box before Section 7.4 (bottom of page 185). In one sentence each, interpret the slope, intercept and R2. (Look under "Case Studies" "Summary of Statistical Findings" and "Scope of Inference" for examples of interpretations.)

    4. Test whether there is evidence of a linear association between number of fish species and average summer pH. Use alpha=0.05. Write out hypotheses, rejection region, and test statistic. Also, give the one-sided and two-sided p-values and comment on which is more appropriate for this example.

    5. Construct by hand a 95% confidence interval for the slope. How does this confirm the two-sided test result of the previous problem?

    6. Verify the assumptions on the residuals of the regression by creating 2 plots: a plot of residuals vs. fitted values, and a qq normal plot of residuals. Write out the assumptions and discuss how well they are met in this example. For this homework, all plots go on a single page.

    7. What streams correspond to the two observations with the lowest pH and smallest number of species? Repeat the regression without these two values and create a second regression line plot. Discuss how these cases affect your results. Should they be left in or eliminated in drawing final conclusions about fish and pH? The second regression line plot goes on the single page of plots for this homework.

    8. Does the correlation coefficient give a fair estimate of the strength of the association between number of species and pH? Why or why not?

      • One issue to think about is "ecological correlation," which is summarized here. Look halfway down the web page under the heading "Ecological Correlation" and note the applet you can use to choose variables and look at different datasets.
      Under what conditions would it be more appropriate to use the median or minimum of pH values as the x-variable?


Last modified: Mon Jan 12 15:31:25 EST 2004