To simulate a linear regression dataset, we generate the explanatory variable by randomly choosing 20 points between 0 and 5. We then simulate the response variables through the equation \(y_{i} = 1 + 3x + \epsilon_i\), where \(\epsilon_{i}\) represents our noise term. Thus, the true parameters for our regression equation are \(\beta_{0} =1\) and \(\beta_{1} = 3\).
set.seed(21)
n <- 20
x <- runif(n,min=0,max=5)
y <- 1 + 3*x + rnorm(n,0,1)
model <- lm(y~x)
summary(model)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.01397 -0.54349 0.01451 0.62676 1.83458
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0927 0.3993 2.736 0.0136 *
## x 2.9096 0.1309 22.231 1.54e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9482 on 18 degrees of freedom
## Multiple R-squared: 0.9649, Adjusted R-squared: 0.9629
## F-statistic: 494.2 on 1 and 18 DF, p-value: 1.54e-14
We now add the non-leverage outlier, \((2.5,18)\), to our data and re-fit the model.
y.out <- c(y,18)
x.out <- c(x,2.5)
model.out <- lm(y.out~x.out)
summary(model.out)
##
## Call:
## lm(formula = y.out ~ x.out)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4525 -0.9835 -0.3203 0.1754 9.1735
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.5901 0.9801 1.622 0.121
## x.out 2.8946 0.3238 8.940 3.1e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.346 on 19 degrees of freedom
## Multiple R-squared: 0.8079, Adjusted R-squared: 0.7978
## F-statistic: 79.93 on 1 and 19 DF, p-value: 3.097e-08
Notice that the outlier has affected our estimate of \(\beta_{0}\), compared to the original data, but it hasn’t changed our estimate of \(\beta_{1}\) by much.
We now consider our original data with the added leverage point, \((8,25)\). This point has high leverage point because it’s far way from our original data horizontally.
y.lev <- c(y,25)
x.lev <- c(x,8)
model.lev <- lm(y.lev~x.lev)
summary(model.lev)
##
## Call:
## lm(formula = y.lev ~ x.lev)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0880 -0.5768 0.2110 0.6768 1.7337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0077 0.3580 2.815 0.0111 *
## x.lev 2.9500 0.1037 28.446 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9299 on 19 degrees of freedom
## Multiple R-squared: 0.9771, Adjusted R-squared: 0.9759
## F-statistic: 809.2 on 1 and 19 DF, p-value: < 2.2e-16
The leverage point hasn’t affected our estimate of the slope because it follows the linear trend of the orignal data. Thus, the point is not considered to be influential.
Finally, consider our original data with a different leverage point, \((8,5)\).
y.inf <- c(y,5)
x.inf <- c(x,8)
model.inf <- lm(y.inf~x.inf)
summary(model.inf)
##
## Call:
## lm(formula = y.inf ~ x.inf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.0416 -1.3650 0.1546 2.1385 4.9306
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7023 1.3947 2.655 0.015645 *
## x.inf 1.6674 0.4041 4.127 0.000574 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.623 on 19 degrees of freedom
## Multiple R-squared: 0.4727, Adjusted R-squared: 0.4449
## F-statistic: 17.03 on 1 and 19 DF, p-value: 0.0005736
The added point has significantly changed our estimate of the slope from the original model and therefore it’s considered to be influential.