STA114/MTH136 Anscombe Data

STA114/MTH136: Statistics

Anscombe Data Sets

Anscombe offers four data sets:

each with eleven observations of two variables we'll call X and Y. Look at the following S-Plus session:


># This command is needed under UNIX but not under Windows:
># motif()
>
># If necessary, include directory where you put your copy of "anscombe.1"
>  a1 <- read.table("anscombe.1",col.names=c("X","Y"));
># Scatter Plot:
>  fit <- lm(Y~X, data=a1);
>  plot(a1$X, a1$Y, xlab="X", ylab="Y", main="Anscombe 1st Data Set");
>  abline(fit);
># Residual Plot:
>  pre <- predict(fit);
>  res <- residuals(fit);
>  plot(pre, res, xlab="Predicted", ylab="Residuals",
>       main="Anscombe 1: Residuals vs. Predicted");
>  abline(h=0);

This generates a "scatter-point plot" of the points (X,Y) from the dataset, and a "residual plot" revealing any pattern that is not captured by the linear model.

The numerical linear regression results can be shown briefly by typing "print(fit)" or simply "fit":


> print(fit)
Call:
lm(formula = Y ~ X, data = a1)

Coefficients:
       (Intercept)                  X 
 3.000090909090909 0.5000909090909091

Degrees of freedom: 11 total; 9 residual
Residual standard error: 1.236603322726321 
>

Thus the regression equation is approximately Y_i = 3.00 + 0.50 X_i ± e_i, where the residual errors e_i have mean zero and variance 1.2366². More detail is available with "summary(fit)":


> summary(fit)

Call: lm(formula = Y ~ X, data = a1)
Residuals:
                Min                  1Q               Median 
 -1.921272727272728 -0.4557727272727272 -0.04136363636363576
                 3Q               Max 
 0.7094090909090909 1.838818181818182

Coefficients:
                          Value          Std. Error             t value            Pr(>|t|)
(Intercept) 3.00009090909090900 1.12474679080864400 2.66734782762436200 0.02573405139916262
          X 0.50009090909090910 0.11790550059563410 4.24145528889283200 0.00216962887307881

Residual standard error: 1.236603322726321 on 9 degrees of freedom
Multiple R-Squared: 0.6665424595087749 
F-statistic: 17.98994296767698 on 1 and 9 degrees of freedom, the p-value
is  0.00216962887307881 

Correlation of Coefficients:
          (Intercept) 
X -0.9434563530497264
>

This gives a bit more information: we learn standard errors for the regression coeffients, etc. But the real question of how well the linear model fits is difficult or impossible to glean from these tabulated numbers, as becomes clear when "anscombe.1" is replaced in turn by each of the other four datasets. Now you try it, with each of the four data sets anscombe.1, anscombe.2, anscombe.3, and anscombe.4. The remarkable thing is that the regression equation, "Std. Error" values and t-values, F-statistic, etc. are IDENTICAL for all four data sets- but the graphical residual analysis above reveals that they are quite different. Why? Try it!