class: center, middle, inverse, title-slide # Model Diagnostics ### Dr. Maria Tackett ### 02.18.19 --- ## Announcements - HW 03 due today at 11:59p - Lab 05 due Wed, Feb 20 at 11:59p - Looking ahead: - **Exam 1 on Mon, Feb 25** - Practice exam on Sakai - Can bring 1 page of notes --- ## R packages ```r library(tidyverse) library(knitr) library(broom) library(cowplot) # use plot_grid function ``` --- class: middle, center ## Influential and Leverage Points --- ## Influential Observations An observation is <font class="vocab3">influential</font> if removing it substantially changes the coefficients of the regression model <img src="10-model-diagnostics_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- ## Influential Observations - In addition to the coefficients, influential observations can have a large impact on the standard errors - Occasionally these observations can be identified in the scatterplot + This is often not the case - especially when dealing with multivariate data - We will use measures to quantify an individual observation's influence on the regression model + **leverage**, **standardized residuals**, and **Cook's distance** --- ## Leverage - <font class="vocab3">Leverage: </font> measure of the distance between an observation's explanatory variable values and the average explanatory variables for the whole data set - An observation has high leverage if its combination of values for the explanatory variables is very far from the typical combinations in the data + It is **<u>potentially</u>** an influential point, i.e. may have a large impact on the coefficient estimates and standard errors - **Note:** Identifying points with high leverage has nothing to do with the values of the response variables --- ## Caculating Leverage - <font class="vocab">Simple Regression:</font> leverage of the `\(i^{th}\)` observation is .alert[ `$$h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{j=1}^{n}(x_j-\bar{x})^2}$$` ] - <font class="vocab">Multiple Regression:</font> leverage of the `\(i^{th}\)` observation is the `\(i^{th}\)` diagonal of `$$\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$$` --- ## High Leverage - Values of leverage are between `\(\frac{1}{n}\)` and 1 for each observation - The average leverage for all observations in the data set is `\(\frac{(p+1)}{n}\)` - Determination of **high leverage** is relatively arbitrary + one threshold is `\(\frac{2(p+1)}{n}\)` - Observations with high leverage tend to have small residuals --- ## High Leverage - Questions to check if you identify points with high leverage: + Are they a result of data entry errors? + Are they in the scope for the individuals for which you want to make predictions? + Are they impacting the estimates of the model coefficients, especially for interactions? - Just because a point has high leverage does not necessarily mean it will have a substantial impact on the regression. Therefore you should check other measures. --- ### Standardized & Studentized Residuals - What is the best way to identify outliers (points that don't fit the pattern from the regression line)? -- - Look for points that have large residuals -- - We want a common scale, so we can more easily identify "large" residuals -- - We will look at each residual divided by its standard error --- ## Standardized Residuals .alert[ `$$std.res_i = \frac{e_i}{\hat{\sigma}\sqrt{1-h_i}}$$` ] <br> -- - The standard error of a residual, `\(\hat{\sigma}\sqrt{1-h_i}\)` depends on the value of the predictor variables -- - Residuals for observations that are high leverage have smaller variance than residuals for observations that are low leverage + This is because the regression line tries to fit high leverage observations as closely as possible --- ## Standardized Residuals - Values with very large standardized residuals are outliers, since they don't fit the pattern determined by the regression model - Observations with standardized residuals with magnitude $ > 2$ should be more closely examined - Observations with large standardized residuals are outliers but may not have an impact on the regression line - <font class="vocab">Good Practice: </font>Make residual plots with standardized residuals - It is easier to identify outliers and check for constant variance assumption --- ### Motivating Cook's Distance - If a observation has a large impact on the estimated regression coefficients, when we drop that observation... + The estimated coefficients should change + The predicted `\(Y\)` value for that observation should change - One way to determine each observation's impact could be to delete it, rerun the regression, compare the predicted `\(Y\)` values from the new and original models + This could be very time consuming - Instead, we can use <font class="vocab3">Cook's Distance</font> which gives a measure of the change in the predicted `\(Y\)` value when an observation is dropped --- ## Cook's Distance - <font class="vocab3">Cook's Distance: </font> Measure of an observation's overall impact, i.e. the effect removing the observation has on the estimated coefficients - For the `\(i^{th}\)` observation, we can calculate Cook's Distance as .alert[ `$$D_i = \frac{1}{p}(std.res_i)^2\bigg(\frac{h_i}{1-h_i}\bigg)$$` ] - *Note:* Cook's distance, `\(D_i\)`, incorporates both the residual and the leverage for each observation - An observation with large `\(D_i\)` is said to have a strong influence on the predicted values --- ## Using these measures - Standardized residuals, leverage, and Cook's Distance should all be examined together - Examine plots of the measures to identify observations that may have an impact on your regression model - Some thresholds for flagging potentially influential observations: + **Leverage**: `\(h_i > \frac{2(p+1)}{n}\)` (some software uses `\(2p/n\)`) + **Standardized Residuals**: `\(|std.res_i| > 2\)` + **Cook's Distance**: `\(D_i > 1\)` --- ### What to do with outliers/influential observations? - It is **<font color="green">OK</font>** to drop an observation based on the **<u>predictor variables</u>** if... - It is meaningful to drop the observation given the context of the problem - You intended to build a model on a smaller range of the predictor variables. Mention this in the write up of the results and be careful to avoid extrapolation when making predictions -- - It is **<font color="red">not OK</font>** to drop an observation based on the response variable - These are legitimate observations and should be in the model - You can try transformations or increasing the sample size by collecting more data -- - In either instance, you can try building the model with and without the outliers/influential observations --- ## Model diagnostics in R - Use the <font class="vocab">`augment`</font> function in the broom package to output the model diagnostics (along with the predicted values and residuals) - Output from `augment` : - response and predictor variables in the model - `.fitted`: predicted values - `.se.fit`: standard errors of predicted values - `.resid`: residuals - **`.hat`**: leverage - `.sigma`: estimate of residual standard deviation when corresponding observation is dropped from model - **`.cooksd`**: Cook's distance - **`.std.resid`**: standardized residuals --- ## Example: Restaurant tips What affects the amount customers tip at a restaurant? - **Response:** - <font class="vocab">`Tip`</font>: amount of the tip - **Predictors:** - <font class="vocab">`Party`</font>: number of people in the party - <font class="vocab">`Meal`</font>: time of day (Lunch, Dinner, Late Night) - <font class="vocab">`Age`</font>: age category of person paying the bill (Yadult, Middle, SenCit) ```r tips <- read_csv("data/tip-data.csv") %>% filter(!is.na(Party)) ``` --- ## Example: Tips ```r model1 <- lm(Tip ~ Party + Meal + Age , data = tips) kable(tidy(model1),format="html",digits=3) ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 1.254 </td> <td style="text-align:right;"> 0.394 </td> <td style="text-align:right;"> 3.182 </td> <td style="text-align:right;"> 0.002 </td> </tr> <tr> <td style="text-align:left;"> Party </td> <td style="text-align:right;"> 1.808 </td> <td style="text-align:right;"> 0.121 </td> <td style="text-align:right;"> 14.909 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> MealLate Night </td> <td style="text-align:right;"> -1.632 </td> <td style="text-align:right;"> 0.407 </td> <td style="text-align:right;"> -4.013 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> MealLunch </td> <td style="text-align:right;"> -0.612 </td> <td style="text-align:right;"> 0.402 </td> <td style="text-align:right;"> -1.523 </td> <td style="text-align:right;"> 0.130 </td> </tr> <tr> <td style="text-align:left;"> AgeSenCit </td> <td style="text-align:right;"> 0.390 </td> <td style="text-align:right;"> 0.394 </td> <td style="text-align:right;"> 0.990 </td> <td style="text-align:right;"> 0.324 </td> </tr> <tr> <td style="text-align:left;"> AgeYadult </td> <td style="text-align:right;"> -0.505 </td> <td style="text-align:right;"> 0.412 </td> <td style="text-align:right;"> -1.227 </td> <td style="text-align:right;"> 0.222 </td> </tr> </tbody> </table> --- ## Using `augment` function - Use the `augment` function to add predicted values and model diagnostics to data - Add the observation number for diagnostic plots ```r tips_output <- augment(model1) %>% mutate(obs_num = row_number()) ``` --- ## Augmented data ```r glimpse(tips_output) ``` ``` ## Observations: 169 ## Variables: 12 ## $ Tip <dbl> 2.99, 2.00, 5.00, 4.00, 10.34, 4.85, 5.00, 4.00, 5.00… ## $ Party <int> 1, 1, 1, 3, 2, 2, 4, 3, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2,… ## $ Meal <chr> "Dinner", "Dinner", "Dinner", "Dinner", "Dinner", "Di… ## $ Age <chr> "Yadult", "Yadult", "SenCit", "Middle", "SenCit", "Mi… ## $ .fitted <dbl> 2.5562830, 2.5562830, 3.4515838, 6.6766419, 5.2591209… ## $ .se.fit <dbl> 0.3771863, 0.3771863, 0.3939434, 0.2327069, 0.3604347… ## $ .resid <dbl> 0.43371698, -0.55628302, 1.54841620, -2.67664190, 5.0… ## $ .hat <dbl> 0.03722423, 0.03722423, 0.04060519, 0.01416878, 0.033… ## $ .sigma <dbl> 1.960700, 1.960502, 1.957071, 1.949536, 1.918486, 1.9… ## $ .cooksd <dbl> 3.294208e-04, 5.419132e-04, 4.612379e-03, 4.554814e-0… ## $ .std.resid <dbl> 0.226100143, -0.289994804, 0.808622859, -1.378941928,… ## $ obs_num <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16… ``` --- ## Leverage ```r leverage_threshold <- 2*(5+1)/nrow(tips) ``` ```r ggplot(data = tips_output, aes(x = obs_num,y = .hat)) + geom_point(alpha = 0.7) + geom_hline(yintercept = leverage_threshold,color = "red")+ labs(x = "Observation Number",y = "Leverage",title = "Leverage") + geom_text(aes(label=ifelse(.hat > leverage_threshold, as.character(obs_num), "")), nudge_x = 4) ``` <img src="10-model-diagnostics_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- ### Points with high leverage ```r tips_output %>% filter(.hat > leverage_threshold) %>% select(Party, Meal, Age) ``` ``` ## # A tibble: 5 x 3 ## Party Meal Age ## <int> <chr> <chr> ## 1 8 Late Night Middle ## 2 7 Dinner SenCit ## 3 6 Dinner SenCit ## 4 6 Late Night Middle ## 5 9 Lunch SenCit ``` .question[ Why do you think these points have high leverage? ] --- ### Standardized Residuals vs. Predicted ```r ggplot(data = tips_output, aes(x = .fitted,y = .std.resid)) + geom_point(alpha = 0.7) + geom_hline(yintercept = 0,color = "red") + geom_hline(yintercept = -2,color = "red",linetype = "dotted") + geom_hline(yintercept = 2,color = "red",linetype = "dotted") + labs(x ="Predicted Value",y ="Standardized Residuals",title = "Standardized Residuals vs. Predicted") + geom_text(aes(label = ifelse(abs(.std.resid) >2,as.character(obs_num),"")), nudge_x = 0.3) ``` <img src="10-model-diagnostics_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ### Standardized residuals vs. predictors <img src="10-model-diagnostics_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- ### Points with large magnitude std.res. ```r tips_output %>% filter(abs(.std.resid) > 2) %>% select(Party, Meal, Age, Tip) ``` ``` ## # A tibble: 8 x 4 ## Party Meal Age Tip ## <int> <chr> <chr> <dbl> ## 1 2 Dinner SenCit 10.3 ## 2 2 Dinner Yadult 11 ## 3 8 Late Night Middle 19.5 ## 4 5 Lunch Middle 15 ## 5 4 Dinner Middle 16 ## 6 6 Late Night Middle 5 ## 7 4 Lunch Middle 4 ## 8 3 Lunch Middle 10 ``` .question[ - Why do you think these points have standardized residuals with large magnitude? - What other variables could you examine? ] --- ## Why we want to find outliers - Estimate of regression standard deviation, `\(\hat{\sigma}\)`, using all observations ```r glance(model1)$sigma ``` ``` ## [1] 1.954983 ``` - Estimate of `\(\hat{\sigma}\)` without points with large magnitude standardized residuals ```r tips_output %>% filter(abs(.std.resid) <= 2) %>% summarise(sigma_est = sqrt(sum(.resid^2)/(n() - 5 - 1))) ``` ``` ## # A tibble: 1 x 1 ## sigma_est ## <dbl> ## 1 1.56 ``` - Recall that we use `\(\hat{\sigma}\)` to calculate the standard errors for all confidence intervals and p-values, so outliers can affect conclusions drawn from model --- ### Cook's Distance ```r ggplot(data = tips_output, aes(x = obs_num, y = .cooksd)) + geom_point(alpha = 0.7) + geom_hline(yintercept=1,color = "red")+ labs(x= "Observation Number",y = "Cook's Distance",title = "Cook's Distance") + geom_text(aes(label = ifelse(.hat>1,as.character(obs_num),""))) ``` <img src="10-model-diagnostics_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- class: middle, center ## Multicollinearity --- ### Why multicollinearity is a problem - We can't include two variables that have a perfect linear association with each other - If we did so, we could not pick a unique best fit model --- ### Why multicollinearity is a problem - Ex. Suppose the true population regression equation is `\(\mu\{Y|X\} = 3 + 4X\)` -- - Suppose we try estimating that regression model using the variables `\(X\)` and `\(Z = X/10\)` `$$\begin{aligned}\hat{\mu}\{Y|X\} &= \hat{\beta}_0 + \hat{\beta}_1X + \hat{\beta}_2\frac{X}{10}\\ &= \hat{\beta}_0 + \bigg(\hat{\beta}_1 + \frac{\hat{\beta}_2}{10}\bigg)X\end{aligned}$$` -- - We can set `\(\hat{\beta}_1\)` and `\(\hat{\beta}_2\)` to any two numbers such that `\(\hat{\beta}_1 + \frac{\hat{\beta}_2}{10} = 4\)` + The data then is unable to choose the "best" combination of `\(\hat{\beta}_1\)` and `\(\hat{\beta}_2\)` --- ### Why multicollinearity is a problem - When we have almost perfect collinearities (i.e. highly correlated explanatory variables), the standard errors for our regression coefficients inflate - In other words, we lose precision in our estimates of the regression coefficients --- ## Detecting Multicollinearity Multicollinearity may occur when... - There are very high correlations `\((r > 0.9)\)` among two or more explanatory variables, especially for smaller sample sizes -- - One (or more) explanatory variables is an almost perfect linear combination of the others -- - Include quadratic terms without first mean-centering the variables before squaring -- - Including interactions with two or more continuous variables --- ## Detecting Multicollinearity - Look at a correlation matrix of the explanatory variables, including all dummy variables + Look out for values close to 1 or -1 - If you think one explanatory variable is an almost perfect linear combination of other explanatory variables, you can run a regression of that explanatory variable vs. the others and see if `\(R^2\)` is close to 1 --- ## Detetcing Multicollinearity (VIF) - <font class="vocab">Variance Inflation Factor (VIF)</font>: Measure of multicollinearity .alert[ `$$VIF(\hat{\beta}_j) = \frac{1}{1-R^2_{X_j|X_{-j}}}$$` ] where `\(R^2_X\)` is the proportion of variation `\(X\)` that is explained by the linear combination of the other explanatory variables in the model. - Typically `\(VIF > 10\)` indicates concerning multicollinearity - Use the <font class="vocab">`vif()`</font> function in the `rms` package to calculate VIF --- ## Tips VIF - Calculate VIF using the <font class="vocab">`vif`</font> function in the rms package ```r library(rms) tidy(vif(model1)) ``` ``` ## # A tibble: 5 x 2 ## names x ## <chr> <dbl> ## 1 Party 1.19 ## 2 MealLate Night 1.25 ## 3 MealLunch 1.09 ## 4 AgeSenCit 1.10 ## 5 AgeYadult 1.40 ``` --- ### Calculating VIF for `Party` ```r party_model <- lm(Party ~ Meal + Age, data=tips) r.sq <- glance(party_model)$r.squared (vif <- 1/(1-r.sq)) ``` ``` ## [1] 1.193821 ``` --- ### Calculating VIF for `MealLateNight` ```r # create indicator variables for Meal tips <- tips %>% mutate(late_night = if_else(Meal=="Late Night",1,0), lunch = if_else(Meal=="Lunch",1,0)) ``` ```r late_night_model <- lm(late_night ~ lunch + Party + Age, data=tips) r.sq <- glance(late_night_model)$r.squared (vif <- 1/(1-r.sq)) ``` ``` ## [1] 1.250908 ```