Homework 1

Problem 3

Problem 4

The two streams with the largest residuals are Upper Keyup and Orcutt. The residuals are equal to the actual values minus the predicted values. In the case of Upper Keyup, for example, 1-4.0287=-3.0287.

Problem 5

Use the printout giving the distribution of the residuals for this problem (labeled R_Fish). Y-bar, which is nearly zero, is less than the median (0.0887), meaning the residuals have a slight negative skew. The standard deviation (1.7982) is greater than IQR/1.35 (1.15), meaning the tails are heavier than normal.

Problem 6

The two unusual data points are Wilder and Templeton, each has a very low pH and no fish species. A linear regression of fish species on average summer pH without these two points results in the Model Equation:
	FISH = -13.86 + 3.138(SUMMER pH)
This equation is a bit different from the model equation from the full data-set, with a gentler slope and higher Y-intercept. The model explains less of the total variation in the data (R- squared for the model is reduced to .12 from .48) and the F-test of the slope (in the ANOVA table) reveals that the slope is no longer significant at the p= 0.05 level (p>0.3). Therefore, witout these two points, the data do not show a clear linear relationship between numbers of fish species and stream pH.

The decision to include or exclude these points is somewhat subjective, but the data does provide some evidence for making this decision. MOST IMPORTANTLY, the change in the strength of association between the variables, observed above, does NOT provide justification for retaining the two points in the model (or for rejecting them). There are two possible scenarios:

1) There is no clear linear relationship between fish number and pH in the data and the two outlying points have undue influence on the regression fit, creating an artificially strong relationship.

2) There is a real linear relationship between pH and number of fish, which is only apparent when the two low pH data points are included.

We don't have strong evidence to distinguish between these options, but evidence in favor of (1) includes the following: Outlying data points often have unusual influence on regression equations; the two streams are very different from other streams, suggesting that the relationship between variables may be different at this low pH; and the residuals appear to be a bit better distributed without the two points. Arguments in favor of (2) center on the idea that there are very few data points and without having an explicit reason to reject the points, we might as well include them. Whatever the decision, the best answers suggested that the sensitivity of the test to these two unusual data points indicates that we must use caution in interpreting this model.