STA113: Probability and Statistics in Engineering

Anscombe Data Sets

Anscombe offers four data sets:
anscombe.1
anscombe.2
anscombe.3
anscombe.4

each with eleven observations of two variables we'll call X and Y. Look at the following Minitab session:


 MTB > read 'anscombe.1' into c1-c2
Entering data from file: anscombe.1
      11 rows read.
 MTB > name c1 'X' c2 'Y'
 MTB > regress c2 1 c1; 
 SUBC> fits = c3;
 SUBC> resid = c4.

 The regression equation is
 Y = 3.00 + 0.500 X
 
 Predictor       Coef       Stdev    t-ratio        p
 Constant       3.000       1.125       2.67    0.026
 X             0.5001      0.1179       4.24    0.002
 
 s = 1.237       R-sq = 66.7%     R-sq(adj) = 62.9%
 
 Analysis of Variance
 
 SOURCE       DF          SS          MS         F        p
 Regression    1      27.510      27.510     17.99    0.002
 Error         9      13.763       1.529
 Total        10      41.273
 
 MTB > name c3 'Fit' c4 'Res'
 MTB > plot c4 c3
 
  Res     -                                             *
          -
          -
       1.2+               *              *
          -
          -
          -
          -          *
       0.0+                         *         *                   *
          -                                        *
          -
          -     *
          -
      -1.2+
          -
          -                    *
          -                                                  *
            ----+---------+---------+---------+---------+---------+--Fit     
              5.0       6.0       7.0       8.0       9.0      10.0
 
Now you try it, with each of the four data sets anscombe.1, anscombe.2, anscombe.3, and anscombe.4. The remarkable thing is that the regression equation, Stdev values and t-ratios, Analysis of Variance table, etc. are IDENTICAL for all four data sets-- but the residual analysis reveals that they are quite different. Why?