## STA114: Statistics

### Anscombe Data Sets

Anscombe offers four data sets:
anscombe.1
anscombe.2
anscombe.3
anscombe.4

each with eleven observations of two variables we'll call X and Y. Look at the following R commands:

```
> fit <- lm(Y~X,data=a1);
> summary(fit);
> plot(fitted.values(fit), residuals(fit));
```
This reads the first Anscombe dataset into a new variable "a1", then fits a linear model (that's what "lm" stands for) to regress Y on X. The summary report is:
```
Call:
lm(formula = Y ~ X, data = a1)

Residuals:
Min       1Q   Median       3Q      Max
-1.92127 -0.45577 -0.04136  0.70941  1.83882

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   3.0001     1.1247   2.667  0.02573 *
X             0.5001     0.1179   4.241  0.00217 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared: 0.6665,     Adjusted R-squared: 0.6295
F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.002170
```
indicating that the regression fit is Y = 3.0 + 0.5 X. The residual plot is something like:
```
Res     -                                             *
-
-
1.2+               *              *
-
-
-
-          *
0.0+                         *         *                   *
-                                        *
-
-     *
-
-1.2+
-
-                    *
-                                                  *
----+---------+---------+---------+---------+---------+--Fit
5.0       6.0       7.0       8.0       9.0      10.0

```
Now you try it, with each of the four data sets anscombe.1, anscombe.2, anscombe.3, and anscombe.4. The remarkable thing is that the regression equation, Stdev values and t-ratios, Analysis of Variance table, etc. are IDENTICAL for all four data sets-- but the residual analysis reveals that they are quite different. Why?