class: center, middle, inverse, title-slide .title[ # Assumptions (kind of) ] .author[ ### Yue Jiang ] .date[ ### STA 210 / Duke University / Spring 2024 ] --- ### Outliers and leverage .vocab[Outliers] are points that don't follow the general pattern of the rest of the data Points are said to have high .vocab[leverage] when they are extreme in some sense (e.g., unusual variable values) .vocab[Influential] points are those that disproportionately influence the results from regression fits (e.g., slope estimates, etc.) --- ### Outliers and leverage <img src="assumptions_2_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> --- ### Outliers and leverage <img src="assumptions_2_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- ### Outliers and leverage <img src="assumptions_2_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- ### Outliers and leverage <img src="assumptions_2_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ### Cook's distance .vocab[Cook's distance] is an estimate of how influential each observation is in a linear regression model. It's basically a measure of how all of the fitted values change when the `\(i^{th}\)` observation is removed - larger Cook's d implies larger influence (Cook's d greater than 0.5 or so is a good rule of thumb for a potentially influential point) ```r library(car) plot(cooks.distance(lm(y ~ x)), xlab = "Observation Index", ylab = "Cook's distance for regression model") ``` --- ### Cook's distance <img src="assumptions_2_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ### Cook's distance <img src="assumptions_2_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ### Cook's distance <img src="assumptions_2_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- ### Cook's distance <img src="assumptions_2_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- ### Remember "augment"? ```r library(tidymodels) augment(lm(y ~ x)) # your model object goes here ``` ``` ## # A tibble: 20 × 8 ## y x .fitted .resid .hat .sigma .cooksd .std.resid ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 14.8 3.86 13.8 0.986 0.0871 0.946 0.0562 1.09 ## 2 18.5 5.36 17.7 0.780 0.0802 0.958 0.0319 0.855 ## 3 15.1 4.23 14.8 0.322 0.0608 0.975 0.00396 0.350 ## 4 19.1 5.65 18.4 0.655 0.109 0.964 0.0327 0.730 ## 5 18.9 5.82 18.8 0.0645 0.131 0.978 0.000400 0.0728 ## 6 13.2 3.14 12.0 1.23 0.187 0.921 0.236 1.44 ## 7 16.3 4.58 15.7 0.576 0.0503 0.968 0.0102 0.622 ## 8 17.1 5.68 18.5 -1.41 0.113 0.909 0.157 -1.57 ## 9 16.7 4.65 15.9 0.810 0.0500 0.957 0.0201 0.874 ## 10 14.6 4.37 15.1 -0.489 0.0548 0.971 0.00809 -0.529 ## 11 18.5 5.87 19.0 -0.426 0.138 0.972 0.0187 -0.483 ## 12 14.9 4.36 15.1 -0.238 0.0551 0.977 0.00194 -0.258 ## 13 16.1 5.03 16.8 -0.751 0.0586 0.960 0.0206 -0.814 ## 14 15.4 4.72 16.0 -0.592 0.0503 0.967 0.0108 -0.639 ## 15 11.3 3.31 12.4 -1.11 0.157 0.934 0.150 -1.27 ## 16 17.4 5.70 18.5 -1.12 0.115 0.935 0.102 -1.25 ## 17 14.1 3.74 13.5 0.545 0.0997 0.968 0.0202 0.604 ## 18 11.5 3.13 11.9 -0.408 0.189 0.972 0.0263 -0.476 ## 19 12.8 3.98 14.1 -1.32 0.0766 0.920 0.0870 -1.45 ## 20 20.8 5.86 19.0 1.89 0.137 0.844 0.365 2.14 ``` --- ### Another diagnostic plotting function ```r library(ggfortify) autoplot(lm(y ~ x)) # your model object goes here ``` --- ### Another diagnostic plotting function <img src="assumptions_2_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- ### What to do with outliers? We can often detect outliers visually (e.g., in the residual plot), or by using statistics such as Cook's d or examining leverage or other diagnostic plots. Do not ignore outliers when you find them, and do not automatically delete them! Outliers are often very interesting points that you might want to learn more about, and aren't necessarily mistakes in the data (although sometimes they are). You may want to perform .vocab[sensitivity analyses] after removing outliers. Do your results or overall message change? How .vocab[robust] are your conclusions to the outliers? --- ### Another potential issue ``` ## height age logfev1 ## 1539 1.52 14.1985 0.60432 ## 495 1.68 15.9562 1.16002 ## 353 1.42 11.1102 0.53649 ## 1015 1.42 10.8008 0.62594 ## 132 1.47 11.1923 0.87129 ## 1419 1.63 14.6502 1.21194 ## 803 1.30 7.3128 0.18232 ## 867 1.66 18.0726 1.34025 ## 674 1.63 14.3874 1.17248 ## 129 1.24 7.1348 0.54232 ## 643 1.22 8.2930 0.41211 ## 571 1.35 9.9384 0.60432 ## 98 1.27 7.4251 0.37156 ## 126 1.63 17.0568 1.16002 ## 1954 1.63 15.4552 1.19089 ## 558 1.71 17.7413 1.25846 ## 1672 1.43 11.0856 0.60977 ## 151 1.57 14.5188 1.02245 ## 109 1.49 11.5811 0.94391 ## 1850 1.63 15.7043 1.30563 ## 1908 1.60 16.1232 1.13462 ## 1167 1.26 8.7310 0.49470 ## 1916 1.68 17.2567 1.27257 ## 994 1.51 15.6140 0.93609 ## 1056 1.33 9.2156 0.29267 ## 143 1.45 10.6694 0.68813 ## 890 1.63 16.9665 1.01523 ## 220 1.69 17.6756 1.17866 ## 1566 1.62 14.3874 1.00430 ## 1714 1.57 13.2813 0.70310 ## 1120 1.55 12.6324 0.69813 ``` --- ### Another potential issue ```r m1 <- lm(logfev1 ~ height, data = fev) summary(m1) ``` ``` ## ## Call: ## lm(formula = logfev1 ~ height, data = fev) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.27684 -0.08115 0.04251 0.09438 0.23525 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -2.2213 0.2674 -8.307 3.71e-09 ## height 2.0390 0.1763 11.564 2.21e-12 ## ## Residual standard error: 0.1449 on 29 degrees of freedom ## Multiple R-squared: 0.8218, Adjusted R-squared: 0.8157 ## F-statistic: 133.7 on 1 and 29 DF, p-value: 2.213e-12 ``` --- ### Another potential issue ```r m2 <- lm(logfev1 ~ age, data = fev) summary(m2) ``` ``` ## ## Call: ## lm(formula = logfev1 ~ age, data = fev) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.34938 -0.10021 0.01975 0.04908 0.22280 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.307985 0.105422 -2.921 0.00668 ## age 0.088860 0.007791 11.405 3.08e-12 ## ## Residual standard error: 0.1466 on 29 degrees of freedom ## Multiple R-squared: 0.8177, Adjusted R-squared: 0.8114 ## F-statistic: 130.1 on 1 and 29 DF, p-value: 3.083e-12 ``` --- ### Another potential issue ```r m2 <- lm(logfev1 ~ height + age, data = fev) summary(m2) ``` ``` ## ## Call: ## lm(formula = logfev1 ~ height + age, data = fev) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.310946 -0.072796 0.001951 0.100668 0.239072 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.36514 0.55433 -2.463 0.0202 ## height 1.09719 0.56574 1.939 0.0626 ## age 0.04315 0.02472 1.746 0.0918 ## ## Residual standard error: 0.1401 on 28 degrees of freedom ## Multiple R-squared: 0.8393, Adjusted R-squared: 0.8278 ## F-statistic: 73.11 on 2 and 28 DF, p-value: 7.667e-12 ``` --- ### Another potential issue <img src="assumptions_2_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- ### Multicollinearity .vocab[Multicollinearity] occurs when predictors in a regression model are very highly correlated with each other (if *perfect* multicollinearity exists, then we can't even fit the model!). In this case, since age and height are so highly correlated, it is hard to know which one(s) are "responsible" for higher log-FEV1. When multicollinearity occurs (when the variables are highly co-linear), it "becomes more difficult" to estimate slope parameters. Because of this, we often get inflated standard error estimates, leading to higher p-values than we might expect, or overly wide confidence intervals for regression estimates. You might be able to suspect multicollinearity when the overall F-test of your model is statistically significant, but individual tests of your predictor slopes are now. .question[ What should we do (if anything)? Is simply dropping variables ok? ]