Simulated Data

To simulate a linear regression dataset, we generate the explanatory variable by randomly choosing 20 points between 0 and 5. We then simulate the response variables through the equation \(y_{i} = 1 + 3x + \epsilon_i\), where \(\epsilon_{i}\) represents our noise term. Thus, the true parameters for our regression equation are \(\beta_{0} =1\) and \(\beta_{1} = 3\).

set.seed(21)
n <- 20
x <- runif(n,min=0,max=5)
y <- 1 + 3*x + rnorm(n,0,1)

Least Squares Regression Model

model <- lm(y~x)
summary(model)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.01397 -0.54349  0.01451  0.62676  1.83458 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.0927     0.3993   2.736   0.0136 *  
## x             2.9096     0.1309  22.231 1.54e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9482 on 18 degrees of freedom
## Multiple R-squared:  0.9649, Adjusted R-squared:  0.9629 
## F-statistic: 494.2 on 1 and 18 DF,  p-value: 1.54e-14

Non-Leverage Outlier

We now add the non-leverage outlier, \((2.5,18)\), to our data and re-fit the model.

y.out <- c(y,18)
x.out <- c(x,2.5)
model.out <- lm(y.out~x.out)
summary(model.out)
## 
## Call:
## lm(formula = y.out ~ x.out)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4525 -0.9835 -0.3203  0.1754  9.1735 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.5901     0.9801   1.622    0.121    
## x.out         2.8946     0.3238   8.940  3.1e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.346 on 19 degrees of freedom
## Multiple R-squared:  0.8079, Adjusted R-squared:  0.7978 
## F-statistic: 79.93 on 1 and 19 DF,  p-value: 3.097e-08

Notice that the outlier has affected our estimate of \(\beta_{0}\), compared to the original data, but it hasn’t changed our estimate of \(\beta_{1}\) by much.

Leverage Point (Non-Influential)

We now consider our original data with the added leverage point, \((8,25)\). This point has high leverage point because it’s far way from our original data horizontally.

y.lev <- c(y,25)
x.lev <- c(x,8)
model.lev <- lm(y.lev~x.lev)
summary(model.lev)
## 
## Call:
## lm(formula = y.lev ~ x.lev)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0880 -0.5768  0.2110  0.6768  1.7337 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.0077     0.3580   2.815   0.0111 *  
## x.lev         2.9500     0.1037  28.446   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9299 on 19 degrees of freedom
## Multiple R-squared:  0.9771, Adjusted R-squared:  0.9759 
## F-statistic: 809.2 on 1 and 19 DF,  p-value: < 2.2e-16

The leverage point hasn’t affected our estimate of the slope because it follows the linear trend of the orignal data. Thus, the point is not considered to be influential.

Influential Point

Finally, consider our original data with a different leverage point, \((8,5)\).

y.inf <- c(y,5)
x.inf <- c(x,8)
model.inf <- lm(y.inf~x.inf)
summary(model.inf)
## 
## Call:
## lm(formula = y.inf ~ x.inf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0416  -1.3650   0.1546   2.1385   4.9306 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.7023     1.3947   2.655 0.015645 *  
## x.inf         1.6674     0.4041   4.127 0.000574 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.623 on 19 degrees of freedom
## Multiple R-squared:  0.4727, Adjusted R-squared:  0.4449 
## F-statistic: 17.03 on 1 and 19 DF,  p-value: 0.0005736

The added point has significantly changed our estimate of the slope from the original model and therefore it’s considered to be influential.