STA114 Anscombe Data

STA114: Statistics

Anscombe Data Sets

Anscombe offers four data sets:

: anscombe.1
: anscombe.2
: anscombe.3
: anscombe.4

each with eleven observations of two variables we'll call X and Y. Look at the following R commands:


 > a1 <- read.table("anscombe.1",col.names=c("X","Y"));
 > fit <- lm(Y~X,data=a1);
 > summary(fit);
 > plot(fitted.values(fit), residuals(fit));

This reads the first Anscombe dataset into a new variable "a1", then fits a linear model (that's what "lm" stands for) to regress Y on X. The summary report is:

 
Call:
lm(formula = Y ~ X, data = a1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.92127 -0.45577 -0.04136  0.70941  1.83882 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   3.0001     1.1247   2.667  0.02573 * 
X             0.5001     0.1179   4.241  0.00217 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared: 0.6665,     Adjusted R-squared: 0.6295 
F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.002170

indicating that the regression fit is Y = 3.0 + 0.5 X. The residual plot is something like:

 
  Res     -                                             *
          -
          -
       1.2+               *              *
          -
          -
          -
          -          *
       0.0+                         *         *                   *
          -                                        *
          -
          -     *
          -
      -1.2+
          -
          -                    *
          -                                                  *
            ----+---------+---------+---------+---------+---------+--Fit     
              5.0       6.0       7.0       8.0       9.0      10.0

Now you try it, with each of the four data sets anscombe.1, anscombe.2, anscombe.3, and anscombe.4. The remarkable thing is that the regression equation, Stdev values and t-ratios, Analysis of Variance table, etc. are IDENTICAL for all four data sets-- but the residual analysis reveals that they are quite different. Why?