each with eleven observations of two variables we'll call X and Y. Look at the following R commands:
> a1 <- read.table("anscombe.1",col.names=c("X","Y"));
> fit <- lm(Y~X,data=a1);
> summary(fit);
> plot(fitted.values(fit), residuals(fit));
This reads the first Anscombe dataset into a new variable "a1", then
fits a linear model (that's what "lm" stands for) to regress Y on X.
The summary report is:
Call:
lm(formula = Y ~ X, data = a1)
Residuals:
Min 1Q Median 3Q Max
-1.92127 -0.45577 -0.04136 0.70941 1.83882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0001 1.1247 2.667 0.02573 *
X 0.5001 0.1179 4.241 0.00217 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
F-statistic: 17.99 on 1 and 9 DF, p-value: 0.002170
indicating that the regression fit is Y = 3.0 + 0.5 X. The residual
plot is something like:
Res - *
-
-
1.2+ * *
-
-
-
- *
0.0+ * * *
- *
-
- *
-
-1.2+
-
- *
- *
----+---------+---------+---------+---------+---------+--Fit
5.0 6.0 7.0 8.0 9.0 10.0
Now you try it, with each of the four data sets
anscombe.1, anscombe.2,
anscombe.3, and anscombe.4.
The remarkable thing is that the regression equation, Stdev values and
t-ratios, Analysis of Variance table, etc. are IDENTICAL for all four data
sets-- but the residual analysis reveals that they are quite different. Why?