Model Diagnostics

# Model Diagnostics
### Dr. Maria Tackett
### 02.18.19

---

## Announcements

- HW 03 due today at 11:59p

- Lab 05 due Wed, Feb 20 at 11:59p

- Looking ahead: 
    - **Exam 1 on Mon, Feb 25**
    - Practice exam on Sakai 
    - Can bring 1 page of notes
---

## R packages

```r
library(tidyverse)
library(knitr)
library(broom)
library(cowplot) # use plot_grid function
```

---

## Influential and Leverage Points

---

## Influential Observations

An observation is influential if removing it substantially changes the coefficients of the regression model

---

## Influential Observations

- In addition to the coefficients, influential observations can have a large impact on the standard errors

- Occasionally these observations can be identified in the scatterplot
  + This is often not the case - especially when dealing with multivariate data

- We will use measures to quantify an individual observation's influence on the regression model 
  + **leverage**, **standardized residuals**, and **Cook's distance**

---

## Leverage

- Leverage: measure of the distance between an observation's explanatory variable values and the average explanatory variables for the whole data set
 
- An observation has high leverage if its combination of values for the explanatory variables is very far from the typical combinations in the data 
 + It is **potentially** an influential point, i.e. may have a large impact on the coefficient estimates and standard errors

- **Note:** Identifying points with high leverage has nothing to do with the values of the response variables

---

## Caculating Leverage

- Simple Regression: leverage of the `$i^{th}$` observation is 
.alert[
`$$h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{j=1}^{n}(x_j-\bar{x})^2}$$`
]

- Multiple Regression: leverage of the `$i^{th}$` observation is the `$i^{th}$` diagonal of
`$$\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$$`

---

## High Leverage

- Values of leverage are between `$\frac{1}{n}$` and 1 for each observation

- The average leverage for all observations in the data set is `$\frac{(p+1)}{n}$`

- Determination of **high leverage** is relatively arbitrary 
  + one threshold is `$\frac{2(p+1)}{n}$`

- Observations with high leverage tend to have small residuals

---

## High Leverage

- Questions to check if you identify points with high leverage: 
  + Are they a result of data entry errors?
  + Are they in the scope for the individuals for which you want to make predictions?
  + Are they impacting the estimates of the model coefficients, especially for interactions?

- Just because a point has high leverage does not necessarily mean it will have a substantial impact on the regression. Therefore you should check other measures.

---

### Standardized & Studentized Residuals

- What is the best way to identify outliers (points that don't fit the pattern from the regression line)? 
  
--

- Look for points that have large residuals

- We want a common scale, so we can more easily identify "large" residuals

- We will look at each residual divided by its standard error

---

## Standardized Residuals

- The standard error of a residual, `$\hat{\sigma}\sqrt{1-h_i}$` depends on the value of the predictor variables

- Residuals for observations that are high leverage have smaller variance than residuals for observations that are low leverage

+ This is because the regression line tries to fit high leverage observations as closely as possible
  
---

## Standardized Residuals

- Values with very large standardized residuals are outliers, since they don't fit the pattern determined by the regression model

- Observations with standardized residuals with magnitude $ > 2$ should be more closely examined

- Observations with large standardized residuals are outliers but may not have an impact on the regression line

- Good Practice: Make residual plots with standardized residuals
 - It is easier to identify outliers and check for constant variance assumption

---

### Motivating Cook's Distance

- If a observation has a large impact on the estimated regression coefficients, when we drop that observation...
 + The estimated coefficients should change 
 + The predicted `$Y$` value for that observation should change
 
- One way to determine each observation's impact could be to delete it, rerun the regression, compare the predicted `$Y$` values from the new and original models
 + This could be very time consuming 
 
- Instead, we can use Cook's Distance which gives a measure of the change in the predicted `$Y$` value when an observation is dropped

---

## Cook's Distance

- Cook's Distance: Measure of an observation's overall impact, i.e. the effect removing the observation has on the estimated coefficients

- For the `$i^{th}$` observation, we can calculate Cook's Distance as 
.alert[
`$$D_i = \frac{1}{p}(std.res_i)^2\bigg(\frac{h_i}{1-h_i}\bigg)$$`
]

- *Note:* Cook's distance, `$D_i$`, incorporates both the residual and the leverage for each observation

- An observation with large `$D_i$` is said to have a strong influence on the predicted values

---

## Using these measures

- Standardized residuals, leverage, and Cook's Distance should all be examined together

- Examine plots of the measures to identify observations that may have an impact on your regression model

- Some thresholds for flagging potentially influential observations:
  + **Leverage**: `$h_i > \frac{2(p+1)}{n}$` (some software uses `$2p/n$`)
  + **Standardized Residuals**: `$|std.res_i| > 2$`
  + **Cook's Distance**: `$D_i > 1$`

---

### What to do with outliers/influential observations?

- It is **OK** to drop an observation based on the **predictor variables** if...
 - It is meaningful to drop the observation given the context of the problem 
 - You intended to build a model on a smaller range of the predictor variables. Mention this in the write up of the results and be careful to avoid extrapolation when making predictions

- It is **not OK** to drop an observation based on the response variable
 - These are legitimate observations and should be in the model
- You can try transformations or increasing the sample size by collecting more data

- In either instance, you can try building the model with and without the outliers/influential observations

---

## Model diagnostics in R

- Use the `augment` function in the broom package to output the model diagnostics (along with the predicted values and residuals)

- Output from `augment` : 
    - response and predictor variables in the model
    - `.fitted`: predicted values
    - `.se.fit`: standard errors of predicted values
    - `.resid`: residuals
    - **`.hat`**: leverage
    - `.sigma`: estimate of residual standard deviation when corresponding observation is dropped from model
    - **`.cooksd`**: Cook's distance
    - **`.std.resid`**: standardized residuals

---

## Example: Restaurant tips

What affects the amount customers tip at a restaurant?

- **Response:**
 - `Tip`: amount of the tip
 
- **Predictors:**
 - `Party`: number of people in the party
 - `Meal`: time of day (Lunch, Dinner, Late Night) 
 - `Age`: age category of person paying the bill (Yadult, Middle, SenCit)

```r
tips <- read_csv("data/tip-data.csv") %>%
 filter(!is.na(Party))
```

---

## Example: Tips

```r
model1 <- lm(Tip ~ Party + Meal + Age , data = tips)
kable(tidy(model1),format="html",digits=3)
```

<table>
 <thead>
 <tr>
 <th style="text-align:left;"> term </th>
 <th style="text-align:right;"> estimate </th>
 <th style="text-align:right;"> std.error </th>
 <th style="text-align:right;"> statistic </th>
 <th style="text-align:right;"> p.value </th>
 </tr>
 </thead>
<tbody>
 <tr>
 <td style="text-align:left;"> (Intercept) </td>
 <td style="text-align:right;"> 1.254 </td>
 <td style="text-align:right;"> 0.394 </td>
 <td style="text-align:right;"> 3.182 </td>
 <td style="text-align:right;"> 0.002 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> Party </td>
 <td style="text-align:right;"> 1.808 </td>
 <td style="text-align:right;"> 0.121 </td>
 <td style="text-align:right;"> 14.909 </td>
 <td style="text-align:right;"> 0.000 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> MealLate Night </td>
 <td style="text-align:right;"> -1.632 </td>
 <td style="text-align:right;"> 0.407 </td>
 <td style="text-align:right;"> -4.013 </td>
 <td style="text-align:right;"> 0.000 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> MealLunch </td>
 <td style="text-align:right;"> -0.612 </td>
 <td style="text-align:right;"> 0.402 </td>
 <td style="text-align:right;"> -1.523 </td>
 <td style="text-align:right;"> 0.130 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> AgeSenCit </td>
 <td style="text-align:right;"> 0.390 </td>
 <td style="text-align:right;"> 0.394 </td>
 <td style="text-align:right;"> 0.990 </td>
 <td style="text-align:right;"> 0.324 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> AgeYadult </td>
 <td style="text-align:right;"> -0.505 </td>
 <td style="text-align:right;"> 0.412 </td>
 <td style="text-align:right;"> -1.227 </td>
 <td style="text-align:right;"> 0.222 </td>
 </tr>
</tbody>
</table>

---

## Using `augment` function

- Use the `augment` function to add predicted values and model diagnostics to data
    - Add the observation number for diagnostic plots

```r
tips_output <- augment(model1) %>%
 mutate(obs_num = row_number())
```

---

## Augmented data

```r
glimpse(tips_output)
```

```
## Observations: 169
## Variables: 12
## $ Tip <dbl> 2.99, 2.00, 5.00, 4.00, 10.34, 4.85, 5.00, 4.00, 5.00…
## $ Party <int> 1, 1, 1, 3, 2, 2, 4, 3, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2,…
## $ Meal <chr> "Dinner", "Dinner", "Dinner", "Dinner", "Dinner", "Di…
## $ Age <chr> "Yadult", "Yadult", "SenCit", "Middle", "SenCit", "Mi…
## $ .fitted <dbl> 2.5562830, 2.5562830, 3.4515838, 6.6766419, 5.2591209…
## $ .se.fit <dbl> 0.3771863, 0.3771863, 0.3939434, 0.2327069, 0.3604347…
## $ .resid <dbl> 0.43371698, -0.55628302, 1.54841620, -2.67664190, 5.0…
## $ .hat <dbl> 0.03722423, 0.03722423, 0.04060519, 0.01416878, 0.033…
## $ .sigma <dbl> 1.960700, 1.960502, 1.957071, 1.949536, 1.918486, 1.9…
## $ .cooksd <dbl> 3.294208e-04, 5.419132e-04, 4.612379e-03, 4.554814e-0…
## $ .std.resid <dbl> 0.226100143, -0.289994804, 0.808622859, -1.378941928,…
## $ obs_num <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
```

---

## Leverage

```r
leverage_threshold <- 2*(5+1)/nrow(tips)
```

```r
ggplot(data = tips_output, aes(x = obs_num,y = .hat)) + 
  geom_point(alpha = 0.7) + 
  geom_hline(yintercept = leverage_threshold,color = "red")+
  labs(x = "Observation Number",y = "Leverage",title = "Leverage") +
  geom_text(aes(label=ifelse(.hat > leverage_threshold, as.character(obs_num), "")), nudge_x = 4)
```

---

### Points with high leverage

```r
tips_output %>% filter(.hat > leverage_threshold) %>% 
  select(Party, Meal, Age)
```

```
## # A tibble: 5 x 3
## Party Meal Age 
## <int> <chr> <chr> 
## 1 8 Late Night Middle
## 2 7 Dinner SenCit
## 3 6 Dinner SenCit
## 4 6 Late Night Middle
## 5 9 Lunch SenCit
```

---

### Standardized Residuals vs. Predicted

```r
ggplot(data = tips_output, aes(x = .fitted,y = .std.resid)) +
  geom_point(alpha = 0.7) + 
  geom_hline(yintercept = 0,color = "red") +
  geom_hline(yintercept = -2,color = "red",linetype = "dotted") +
  geom_hline(yintercept = 2,color = "red",linetype = "dotted") +
  labs(x ="Predicted Value",y ="Standardized Residuals",title = "Standardized Residuals vs. Predicted") +
  geom_text(aes(label = ifelse(abs(.std.resid) >2,as.character(obs_num),"")), nudge_x = 0.3)
```

---

### Standardized residuals vs. predictors

---

### Points with large magnitude std.res.

```r
tips_output %>% filter(abs(.std.resid) > 2) %>% 
  select(Party, Meal, Age, Tip)
```

```
## # A tibble: 8 x 4
## Party Meal Age Tip
## <int> <chr> <chr> <dbl>
## 1 2 Dinner SenCit 10.3
## 2 2 Dinner Yadult 11 
## 3 8 Late Night Middle 19.5
## 4 5 Lunch Middle 15 
## 5 4 Dinner Middle 16 
## 6 6 Late Night Middle 5 
## 7 4 Lunch Middle 4 
## 8 3 Lunch Middle 10
```

.question[
- Why do you think these points have standardized residuals with large magnitude?
- What other variables could you examine?
]

---

## Why we want to find outliers

- Estimate of regression standard deviation, `$\hat{\sigma}$`, using all observations

```r
glance(model1)$sigma
```

```
## [1] 1.954983
```

- Estimate of `$\hat{\sigma}$` without points with large magnitude standardized residuals

```r
tips_output %>%
 filter(abs(.std.resid) <= 2) %>%
 summarise(sigma_est = sqrt(sum(.resid^2)/(n() - 5 - 1)))
```

```
## # A tibble: 1 x 1
## sigma_est
## <dbl>
## 1 1.56
```

- Recall that we use `$\hat{\sigma}$` to calculate the standard errors for all confidence intervals and p-values, so outliers can affect conclusions drawn from model

---

### Cook's Distance

```r
ggplot(data = tips_output, aes(x = obs_num, y = .cooksd)) + 
  geom_point(alpha = 0.7) + 
  geom_hline(yintercept=1,color = "red")+
  labs(x= "Observation Number",y = "Cook's Distance",title = "Cook's Distance") +
  geom_text(aes(label = ifelse(.hat>1,as.character(obs_num),"")))
```

---

## Multicollinearity

---

### Why multicollinearity is a problem

- We can't include two variables that have a perfect linear association with each other

- If we did so, we could not pick a unique best fit model

---

### Why multicollinearity is a problem

- Ex. Suppose the true population regression equation is `$\mu\{Y|X\} = 3 + 4X$`

- Suppose we try estimating that regression model using the variables `$X$` and `$Z = X/10$`
`$$\begin{aligned}\hat{\mu}\{Y|X\} &= \hat{\beta}_0 + \hat{\beta}_1X  + \hat{\beta}_2\frac{X}{10}\\
&= \hat{\beta}_0 + \bigg(\hat{\beta}_1 + \frac{\hat{\beta}_2}{10}\bigg)X\end{aligned}$$`

- We can set `$\hat{\beta}_1$` and `$\hat{\beta}_2$` to any two numbers such that `$\hat{\beta}_1 + \frac{\hat{\beta}_2}{10} = 4$`
  + The data then is unable to choose the "best" combination of `$\hat{\beta}_1$` and `$\hat{\beta}_2$`
  
---

### Why multicollinearity is a problem

- When we have almost perfect collinearities (i.e. highly correlated explanatory variables), the standard errors for our regression coefficients inflate

- In other words, we lose precision in our estimates of the regression coefficients 
  
---
 
## Detecting Multicollinearity

Multicollinearity may occur when...
- There are very high correlations `$(r > 0.9)$` among two or more explanatory variables, especially for smaller sample sizes

- One (or more) explanatory variables is an almost perfect linear combination of the others

- Include quadratic terms without first mean-centering the variables before squaring

- Including interactions with two or more continuous variables

---

## Detecting Multicollinearity

- Look at a correlation matrix of the explanatory variables, including all dummy variables 
  + Look out for values close to 1 or -1

- If you think one explanatory variable is an almost perfect linear combination of other explanatory variables, you can run a regression of that explanatory variable vs. the others and see if `$R^2$` is close to 1

---

## Detetcing Multicollinearity (VIF)

- Variance Inflation Factor (VIF): Measure of multicollinearity

where `$R^2_X$` is the proportion of variation `$X$` that is explained by the linear combination of the other explanatory variables in the model.

- Typically `$VIF > 10$` indicates concerning multicollinearity

- Use the `vif()` function in the `rms` package to calculate VIF

---

## Tips VIF

- Calculate VIF using the `vif` function in the rms package

```r
library(rms)
tidy(vif(model1))
```

```
## # A tibble: 5 x 2
## names x
## <chr> <dbl>
## 1 Party 1.19
## 2 MealLate Night 1.25
## 3 MealLunch 1.09
## 4 AgeSenCit 1.10
## 5 AgeYadult 1.40
```

---

### Calculating VIF for `Party`

```r
party_model <- lm(Party ~ Meal + Age, data=tips)
r.sq <- glance(party_model)$r.squared
(vif <- 1/(1-r.sq))
```

```
## [1] 1.193821
```

---

### Calculating VIF for `MealLateNight`

```r
# create indicator variables for Meal
tips <- tips %>% 
 mutate(late_night = if_else(Meal=="Late Night",1,0), 
 lunch = if_else(Meal=="Lunch",1,0))
```

```r
late_night_model <- lm(late_night ~ lunch + Party + Age, data=tips)
r.sq <- glance(late_night_model)$r.squared
(vif <- 1/(1-r.sq))
```

```
## [1] 1.250908
```