Frank Anscombe's Regression Examples

Frank Anscombe's Regression Examples

The intimate relationship between correlation and regression raises the question of whether it is possible for a regression analysis to be misleading in the same sense as the set of scatterplots all of which had a correlation coefficient of 0.70. In 1973, Frank Anscombe published a set of examples showing the answer is a definite yes (Anscombe FJ (1973), "Graphs in Statistical Analysis," The American Statistician, 27, 17-21). Anscombe's examples share not only the same correlation coefficient, but also the same value for any other summary statistic that is usually calculated.

n	11
	9.0
	7.5
Regression equation of y on x	y = 3 + 0.5 x
	110.0
Regression SS	27.5
Residual SS	13.75 (9 df)
Estimated SE of b₁	0.118
r	0.816
R²	0.667

Figure 1 is the picture drawn by the mind's eye when a simple linear regression equation is reported. Yet, the same summary statistics apply to figure 2, which shows a perfect curvilinear relation, and to figure 3, which shows a perfect linear relation except for a single outlier.

The summary statistics also apply to figure 4, which is the most troublesome. Figures 2 and 3 clearly call the straight line relation into question. Figure 4 does not. A straight line may be appropriate in the fourth case. However, the regression equation is determined entirely by the single observation at x=19. Paraphrasing Anscombe, we need to know the relation between y and x and the special contribution of the observation at x=19 to that relation.

x	y1	y2	y3	x4	y4
10	8.04	9.14	7.46	8	6.58
8	6.95	8.14	6.77	8	5.76
13	7.58	8.74	12.74	8	7.71
9	8.81	8.77	7.11	8	8.84
11	8.33	9.26	7.81	8	8.47
14	9.96	8.10	8.84	8	7.04
6	7.24	6.13	6.08	8	5.25
4	4.26	3.10	5.39	19	12.50
12	10.84	9.13	8.15	8	5.56
7	4.82	7.26	6.42	8	7.91
5	5.68	4.74	5.73	8	6.89