| anscombe.1 | anscombe.2 | anscombe.3 | anscombe.4 |
># This command is needed under UNIX but not under Windows:
># motif()
>
># If necessary, include directory where you put your copy of "anscombe.1"
> a1 <- read.table("anscombe.1",col.names=c("X","Y"));
># Scatter Plot:
> fit <- lm(Y~X, data=a1);
> plot(a1$X, a1$Y, xlab="X", ylab="Y", main="Anscombe 1st Data Set");
> abline(fit);
># Residual Plot:
> pre <- predict(fit);
> res <- residuals(fit);
> plot(pre, res, xlab="Predicted", ylab="Residuals",
> main="Anscombe 1: Residuals vs. Predicted");
> abline(h=0);
This generates a "scatter-point plot" of the points (X,Y) from the dataset, and a "residual plot" revealing any pattern that is not captured by the linear model.
The numerical linear regression results can be shown briefly by typing "print(fit)" or simply "fit":
> print(fit)
Call:
lm(formula = Y ~ X, data = a1)
Coefficients:
(Intercept) X
3.000090909090909 0.5000909090909091
Degrees of freedom: 11 total; 9 residual
Residual standard error: 1.236603322726321
>
Thus the regression equation is approximately Yi = 3.00 +
0.50 Xi ± ei, where the residual errors
ei have mean zero and variance 1.23662.
More detail is available with "summary(fit)":
> summary(fit)
Call: lm(formula = Y ~ X, data = a1)
Residuals:
Min 1Q Median
-1.921272727272728 -0.4557727272727272 -0.04136363636363576
3Q Max
0.7094090909090909 1.838818181818182
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 3.00009090909090900 1.12474679080864400 2.66734782762436200 0.02573405139916262
X 0.50009090909090910 0.11790550059563410 4.24145528889283200 0.00216962887307881
Residual standard error: 1.236603322726321 on 9 degrees of freedom
Multiple R-Squared: 0.6665424595087749
F-statistic: 17.98994296767698 on 1 and 9 degrees of freedom, the p-value
is 0.00216962887307881
Correlation of Coefficients:
(Intercept)
X -0.9434563530497264
>
This gives a bit more information: we learn standard errors for the
regression coeffients, etc. But the real question of how well
the linear model fits is difficult or impossible to glean from
these tabulated numbers, as becomes clear when "anscombe.1" is replaced in
turn by each of the other four datasets. Now you try it, with each of the
four data sets anscombe.1,
anscombe.2,
anscombe.3, and
anscombe.4.
The remarkable thing is that the regression equation, "Std. Error"
values and t-values, F-statistic, etc. are IDENTICAL for all four data
sets- but the graphical residual analysis above reveals that they are
quite different. Why? Try it!