each with eleven observations of two variables we'll call X and Y. Look at the following R commands:
> a1 <- read.table("anscombe.1",col.names=c("X","Y")); > fit <- lm(Y~X,data=a1); > summary(fit); > plot(fitted.values(fit), residuals(fit));This reads the first Anscombe dataset into a new variable "a1", then fits a linear model (that's what "lm" stands for) to regress Y on X. The summary report is:
Call: lm(formula = Y ~ X, data = a1) Residuals: Min 1Q Median 3Q Max -1.92127 -0.45577 -0.04136 0.70941 1.83882 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.0001 1.1247 2.667 0.02573 * X 0.5001 0.1179 4.241 0.00217 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.237 on 9 degrees of freedom Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295 F-statistic: 17.99 on 1 and 9 DF, p-value: 0.002170indicating that the regression fit is Y = 3.0 + 0.5 X. The residual plot is something like:
Res - * - - 1.2+ * * - - - - * 0.0+ * * * - * - - * - -1.2+ - - * - * ----+---------+---------+---------+---------+---------+--Fit 5.0 6.0 7.0 8.0 9.0 10.0Now you try it, with each of the four data sets anscombe.1, anscombe.2, anscombe.3, and anscombe.4. The remarkable thing is that the regression equation, Stdev values and t-ratios, Analysis of Variance table, etc. are IDENTICAL for all four data sets-- but the residual analysis reveals that they are quite different. Why?