FISH = -13.86 + 3.138(SUMMER pH)This equation is a bit different from the model equation from the full data-set, with a gentler slope and higher Y-intercept. The model explains less of the total variation in the data (R- squared for the model is reduced to .12 from .48) and the F-test of the slope (in the ANOVA table) reveals that the slope is no longer significant at the p= 0.05 level (p>0.3). Therefore, witout these two points, the data do not show a clear linear relationship between numbers of fish species and stream pH.
The decision to include or exclude these points is somewhat subjective, but the data does provide some evidence for making this decision. MOST IMPORTANTLY, the change in the strength of association between the variables, observed above, does NOT provide justification for retaining the two points in the model (or for rejecting them). There are two possible scenarios:
1) There is no clear linear relationship between fish number and pH in the data and the two outlying points have undue influence on the regression fit, creating an artificially strong relationship.
2) There is a real linear relationship between pH and number of fish, which is only apparent when the two low pH data points are included.
We don't have strong evidence to distinguish between these options, but evidence in favor of (1) includes the following: Outlying data points often have unusual influence on regression equations; the two streams are very different from other streams, suggesting that the relationship between variables may be different at this low pH; and the residuals appear to be a bit better distributed without the two points. Arguments in favor of (2) center on the idea that there are very few data points and without having an explicit reason to reject the points, we might as well include them. Whatever the decision, the best answers suggested that the sensitivity of the test to these two unusual data points indicates that we must use caution in interpreting this model.