Asusmptions (kind of)

class: center, middle, inverse, title-slide

.title[
# Assumptions (kind of)
]
.author[
### Yue Jiang
]
.date[
### STA 210 / Duke University / Spring 2024
]

---

### Outliers and leverage

.vocab[Outliers] are points that don't follow the general pattern of the rest
of the data

Points are said to have high .vocab[leverage] when they are extreme in some
sense (e.g., unusual variable values)

.vocab[Influential] points are those that disproportionately influence the
results from regression fits (e.g., slope estimates, etc.)

---

### Outliers and leverage

---

### Outliers and leverage

---

### Outliers and leverage

---

### Outliers and leverage

---

### Cook's distance

.vocab[Cook's distance] is an estimate of how influential each observation is
in a linear regression model. It's basically a measure of how all of the fitted
values change when the `\(i^{th}\)` observation is removed - larger Cook's d implies
larger influence (Cook's d greater than 0.5 or so is a good rule of thumb for a 
potentially influential point)

```r
library(car)
plot(cooks.distance(lm(y ~ x)), xlab = "Observation Index",
     ylab = "Cook's distance for regression model")
```

---

### Cook's distance

---

### Cook's distance

---

### Cook's distance

---

### Cook's distance

---

### Remember "augment"?

```r
library(tidymodels)
augment(lm(y ~ x)) # your model object goes here
```

```
## # A tibble: 20 × 8
##        y     x .fitted  .resid   .hat .sigma  .cooksd .std.resid
##    <dbl> <dbl>   <dbl>   <dbl>  <dbl>  <dbl>    <dbl>      <dbl>
##  1  14.8  3.86    13.8  0.986  0.0871  0.946 0.0562       1.09  
##  2  18.5  5.36    17.7  0.780  0.0802  0.958 0.0319       0.855 
##  3  15.1  4.23    14.8  0.322  0.0608  0.975 0.00396      0.350 
##  4  19.1  5.65    18.4  0.655  0.109   0.964 0.0327       0.730 
##  5  18.9  5.82    18.8  0.0645 0.131   0.978 0.000400     0.0728
##  6  13.2  3.14    12.0  1.23   0.187   0.921 0.236        1.44  
##  7  16.3  4.58    15.7  0.576  0.0503  0.968 0.0102       0.622 
##  8  17.1  5.68    18.5 -1.41   0.113   0.909 0.157       -1.57  
##  9  16.7  4.65    15.9  0.810  0.0500  0.957 0.0201       0.874 
## 10  14.6  4.37    15.1 -0.489  0.0548  0.971 0.00809     -0.529 
## 11  18.5  5.87    19.0 -0.426  0.138   0.972 0.0187      -0.483 
## 12  14.9  4.36    15.1 -0.238  0.0551  0.977 0.00194     -0.258 
## 13  16.1  5.03    16.8 -0.751  0.0586  0.960 0.0206      -0.814 
## 14  15.4  4.72    16.0 -0.592  0.0503  0.967 0.0108      -0.639 
## 15  11.3  3.31    12.4 -1.11   0.157   0.934 0.150       -1.27  
## 16  17.4  5.70    18.5 -1.12   0.115   0.935 0.102       -1.25  
## 17  14.1  3.74    13.5  0.545  0.0997  0.968 0.0202       0.604 
## 18  11.5  3.13    11.9 -0.408  0.189   0.972 0.0263      -0.476 
## 19  12.8  3.98    14.1 -1.32   0.0766  0.920 0.0870      -1.45  
## 20  20.8  5.86    19.0  1.89   0.137   0.844 0.365        2.14
```

---

### Another diagnostic plotting function

```r
library(ggfortify)
autoplot(lm(y ~ x)) # your model object goes here
```

---

### Another diagnostic plotting function

---

### What to do with outliers?

We can often detect outliers visually (e.g., in the residual plot), or by using
statistics such as Cook's d or examining leverage or other diagnostic plots.

Do not ignore outliers when you find them, and do not automatically delete them!
Outliers are often very interesting points that you might want to learn more
about, and aren't necessarily mistakes in the data (although sometimes they are).

You may want to perform .vocab[sensitivity analyses] after removing outliers. Do
your results or overall message change? How .vocab[robust] are your conclusions
to the outliers?

---

### Another potential issue

```
##      height     age logfev1
## 1539   1.52 14.1985 0.60432
## 495    1.68 15.9562 1.16002
## 353    1.42 11.1102 0.53649
## 1015   1.42 10.8008 0.62594
## 132    1.47 11.1923 0.87129
## 1419   1.63 14.6502 1.21194
## 803    1.30  7.3128 0.18232
## 867    1.66 18.0726 1.34025
## 674    1.63 14.3874 1.17248
## 129    1.24  7.1348 0.54232
## 643    1.22  8.2930 0.41211
## 571    1.35  9.9384 0.60432
## 98     1.27  7.4251 0.37156
## 126    1.63 17.0568 1.16002
## 1954   1.63 15.4552 1.19089
## 558    1.71 17.7413 1.25846
## 1672   1.43 11.0856 0.60977
## 151    1.57 14.5188 1.02245
## 109    1.49 11.5811 0.94391
## 1850   1.63 15.7043 1.30563
## 1908   1.60 16.1232 1.13462
## 1167   1.26  8.7310 0.49470
## 1916   1.68 17.2567 1.27257
## 994    1.51 15.6140 0.93609
## 1056   1.33  9.2156 0.29267
## 143    1.45 10.6694 0.68813
## 890    1.63 16.9665 1.01523
## 220    1.69 17.6756 1.17866
## 1566   1.62 14.3874 1.00430
## 1714   1.57 13.2813 0.70310
## 1120   1.55 12.6324 0.69813
```

---

### Another potential issue

```r
m1 <- lm(logfev1 ~ height, data = fev)
summary(m1)
```

```
## 
## Call:
## lm(formula = logfev1 ~ height, data = fev)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27684 -0.08115  0.04251  0.09438  0.23525 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -2.2213     0.2674  -8.307 3.71e-09
## height        2.0390     0.1763  11.564 2.21e-12
## 
## Residual standard error: 0.1449 on 29 degrees of freedom
## Multiple R-squared:  0.8218,	Adjusted R-squared:  0.8157 
## F-statistic: 133.7 on 1 and 29 DF,  p-value: 2.213e-12
```

---

### Another potential issue

```r
m2 <- lm(logfev1 ~ age, data = fev)
summary(m2)
```

```
## 
## Call:
## lm(formula = logfev1 ~ age, data = fev)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34938 -0.10021  0.01975  0.04908  0.22280 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.307985   0.105422  -2.921  0.00668
## age          0.088860   0.007791  11.405 3.08e-12
## 
## Residual standard error: 0.1466 on 29 degrees of freedom
## Multiple R-squared:  0.8177,	Adjusted R-squared:  0.8114 
## F-statistic: 130.1 on 1 and 29 DF,  p-value: 3.083e-12
```

---

### Another potential issue

```r
m2 <- lm(logfev1 ~ height + age, data = fev)
summary(m2)
```

```
## 
## Call:
## lm(formula = logfev1 ~ height + age, data = fev)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.310946 -0.072796  0.001951  0.100668  0.239072 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.36514    0.55433  -2.463   0.0202
## height       1.09719    0.56574   1.939   0.0626
## age          0.04315    0.02472   1.746   0.0918
## 
## Residual standard error: 0.1401 on 28 degrees of freedom
## Multiple R-squared:  0.8393,	Adjusted R-squared:  0.8278 
## F-statistic: 73.11 on 2 and 28 DF,  p-value: 7.667e-12
```

---

### Another potential issue

---

### Multicollinearity

.vocab[Multicollinearity] occurs when predictors in a regression model are very highly correlated with each other (if *perfect* multicollinearity exists, then we can't even fit the model!). In this case, since age and height are so highly correlated, it is hard to know which one(s) are "responsible" for higher log-FEV1.

When multicollinearity occurs (when the variables are highly co-linear), it "becomes more difficult" to estimate slope parameters. Because of this, we often get inflated standard error estimates, leading to higher p-values than we might expect, or overly wide confidence intervals for regression estimates.

You might be able to suspect multicollinearity when the overall F-test of your model is statistically significant, but individual tests of your predictor slopes are now.

.question[
What should we do (if anything)? Is simply dropping variables ok?
]