TakeHome Exam

Due In class Tuesday March 28

This is an exam. This means no discussion of this problem with anyone in this class or outside this class. If you have any questions you should ask the instructor or one of the TA's.

Problem

Pollution of waterways is one of the most serious problems facing the world today. Billions of dollars have been spent on cleanup, antipollution laws have been passed, technological innovations have been sought to prevent pollution; still the world is probably getting more, not less, polluted. Pollutant levels in various bodies of water are important to study, to get a handle on the pollution problem. In particular, prediction of future levels based on current characteristics is crucial for the determination of strategies to address pollution. The data pcb.dat contains data on PCB concentrations (measured in parts per billion) in samples from US bays and estuaries in 1984 and 1985. The variable names for the three columns are Location, pcb84, and pcb85

Using any of the methods that you know, fit a regression model using the 1984 data to predict PCB levels for 1985 to answer the question of ``Does use of the previous year's data help in predicting the next year's PCB level?''. A second question of interest that your analysis should address is whether the PCB levels have changed over the year or are staying the same (on average). You may need to consider transforming both variables. Be aware that for some locations the measured values may below the limit of detection, and so have been recorded as 0. A common tactic is to add 1 to all cases, if you are going to use log transformations. (others add a small value, 0.1, to the 0's - just document what you do!) Since both variables are measured in the same units, using the same transformations (if needed) on both years data will ease the interpretation of results. Use any diagnostic tools to ensure that the assumptions for the model are appropriate. If there are outliers or influential cases refit the model without these cases and see if it changes the fitted model. For example, it is known that in 1984 Boston stopped dumping raw sewage into Boston Harbor and opened a sewage treatment plant, so we may suspect that there is a change in levels for this case.

Write a 1-2 page (max!) summary of your regression model(s) interpreting your analysis and including any important figures that address the questions of interest. Be sure to interpret any important coefficients (or interval estimates) for the model or other important summary measures. For the lower Chesapeake Bay include and interpret a prediction interval for future PCB concentrations for 1986 (i.e. use the 1985 PCB level as X rather than the 1984 data that were used to fit the model). If you transform the data, make sure that you include interpretations of the model back in the original units. In the summary, you should provide the numerical results that are of interest and document what you did to arrive at your model/conclusions. As a guideline, your summary should contain the major numerical summaries (i.e. estimates, p-values, confidence intervals) and conclusions and should be able to stand on its own; an interested reader should be able to understand what you did and what you concluded without having to turn to the appendix to look up an important number. (Hint: in grading we may not read the appendix at all if the numerical solutions are correct, so don't bury important information there!) Your summary should be concise and to the point. You do not need to go into detail on directions that you tried, but did not prove successful. Instead focus on your final results, but make sure the path you took to get to it is clear, i.e. anyone could go back and reconstruct your results based on your presentation. You should not provide step-by-step details of calculations in the summary, but should include them CLEARLY labeled in an appendix. For example, the step-by-step calculations for the prediction interval for Chesapeake Bay should be contained in the appendix, but you should report only the interval in the summary. The appendix should only contain relevant material and should not exceed 4 pages. You do not need to include everything that you tried, only the material that supports your summary. In your summary, you may also want to comment on ways to improve upon the analysis and any limitations of the model.