anscombe.1 | anscombe.2 | anscombe.3 | anscombe.4 |
># This command is needed under UNIX but not under Windows: ># motif() > ># If necessary, include directory where you put your copy of "anscombe.1" > a1 <- read.table("anscombe.1",col.names=c("X","Y")); ># Scatter Plot: > fit <- lm(Y~X, data=a1); > plot(a1$X, a1$Y, xlab="X", ylab="Y", main="Anscombe 1st Data Set"); > abline(fit); ># Residual Plot: > pre <- predict(fit); > res <- residuals(fit); > plot(pre, res, xlab="Predicted", ylab="Residuals", > main="Anscombe 1: Residuals vs. Predicted"); > abline(h=0);
This generates a "scatter-point plot" of the points (X,Y) from the dataset, and a "residual plot" revealing any pattern that is not captured by the linear model.
The numerical linear regression results can be shown briefly by typing "print(fit)" or simply "fit":
> print(fit) Call: lm(formula = Y ~ X, data = a1) Coefficients: (Intercept) X 3.000090909090909 0.5000909090909091 Degrees of freedom: 11 total; 9 residual Residual standard error: 1.236603322726321 >Thus the regression equation is approximately Yi = 3.00 + 0.50 Xi ± ei, where the residual errors ei have mean zero and variance 1.23662. More detail is available with "summary(fit)":
> summary(fit) Call: lm(formula = Y ~ X, data = a1) Residuals: Min 1Q Median -1.921272727272728 -0.4557727272727272 -0.04136363636363576 3Q Max 0.7094090909090909 1.838818181818182 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 3.00009090909090900 1.12474679080864400 2.66734782762436200 0.02573405139916262 X 0.50009090909090910 0.11790550059563410 4.24145528889283200 0.00216962887307881 Residual standard error: 1.236603322726321 on 9 degrees of freedom Multiple R-Squared: 0.6665424595087749 F-statistic: 17.98994296767698 on 1 and 9 degrees of freedom, the p-value is 0.00216962887307881 Correlation of Coefficients: (Intercept) X -0.9434563530497264 >This gives a bit more information: we learn standard errors for the regression coeffients, etc. But the real question of how well the linear model fits is difficult or impossible to glean from these tabulated numbers, as becomes clear when "anscombe.1" is replaced in turn by each of the other four datasets. Now you try it, with each of the four data sets anscombe.1, anscombe.2, anscombe.3, and anscombe.4. The remarkable thing is that the regression equation, "Std. Error" values and t-values, F-statistic, etc. are IDENTICAL for all four data sets- but the graphical residual analysis above reveals that they are quite different. Why? Try it!