class: center, middle, inverse, title-slide # Model selection
👌 --- layout: true <div class="my-footer"> <span> Dr. Mine Çetinkaya-Rundel - <a href="http://www2.stat.duke.edu/courses/Fall18/sta112.01/schedule" target="_blank">stat.duke.edu/courses/Fall18/sta112.01 </a> </span> </div> --- ## Announcements - Team meetings: Find a time when all team members can come meet with me for 20 minutes, ideally during my office hours, or suggest a different time on Slack. - Office hours: - Thursday (today): 2:30 - 4:30pm - Monday: 11:30 - 1:30pm (no Thursday OH next week) - Changed due date for project proposal - Check out the Data and Visualization Library services for help with finding data for your projects: [library.duke.edu/data/](https://library.duke.edu/data/) - You will be in new teams for your projects - New team HW due Tuesday --- class: center, middle # Quality of fit in MLR --- ## From last time... Load data: ```r pp <- read_csv("data/paris_paintings.csv", na = c("n/a", "", "NA")) ``` -- Filter for paintings < 5000 square inches: ```r pp_Surf_lt_5000 <- pp %>% filter(Surface < 5000) ``` -- Model with main effects only: ```r m_main <- lm(log(price) ~ Surface + factor(artistliving), data = pp_Surf_lt_5000) ``` -- Model with interaction effects: ```r m_int <- lm(log(price) ~ Surface + factor(artistliving) + Surface * factor(artistliving), data = pp_Surf_lt_5000) ``` --- ## R-squared - `\(R^2\)` is the percentage of variability in the response variable explained by the regression model. ```r glance(m_main)$r.squared ``` ``` ## [1] 0.01320884 ``` ```r glance(m_int)$r.squared ``` ``` ## [1] 0.0176922 ``` -- - Clearly the model with interactions has a higher `\(R^2\)`. -- - However using `\(R^2\)` for model selection in models with multiple explanatory variables is not a good idea as `\(R^2\)` increases when **any** variable is added to the model. --- ## R-squared - first principles $$ R^2 = \frac{ SS\_{Reg} }{ SS\_{Total} } = 1 - \left( \frac{ SS\_{Error} }{ SS\_{Total} } \right) $$ .question[ Calculate `\(R^2\)` based on the output below. ] ```r anova(m_main) ``` ``` ## Analysis of Variance Table ## ## Response: log(price) ## Df Sum Sq Mean Sq F value Pr(>F) ## Surface 1 138.5 138.537 40.6741 2.058e-10 ## factor(artistliving) 1 6.8 6.810 1.9994 0.1575 ## Residuals 3188 10858.4 3.406 ``` --- ## Adjusted R-squared $$ R^2\_{adj} = 1 - \left( \frac{ SS\_{Error} }{ SS\_{Total} } \times \frac{n - 1}{n - k - 1} \right), $$ where `\(n\)` is the number of cases and `\(k\)` is the number of predictors in the model -- - Adjusted `\(R^2\)` doesn't increase if the new variable does not provide any new informaton or is completely unrelated. -- - This makes adjusted `\(R^2\)` a preferable metric for model selection in multiple regression models. --- ## In pursuit of Occam's Razor - Occam's Razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected. - Model selection follows this principle. - We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model. - In other words, we prefer the simplest best model, i.e. **parsimonious** model. --- ## Comparing models It appears that adding the interaction actually increased adjusted `\(R^2\)`, so we should indeed use the model with the interactions. ```r glance(m_main)$adj.r.squared ``` ``` ## [1] 0.01258977 ``` ```r glance(m_int)$adj.r.squared ``` ``` ## [1] 0.01676753 ``` --- class: center, middle # Model selection --- ## Backwards elimination - Start with **full** model (including all candidate explanatory variables and all candidate interactions) - Remove one variable at a time, and select the model with the highest adjusted `\(R^2\)` - Continue until adjusted `\(R^2\)` does not increase --- ## Forward selection - Start with **empty** model - Add one variable (or interaction effect) at a time, and select the model with the highest adjusted `\(R^2\)` - Continue until adjusted `\(R^2\)` does not increase --- ## Model selection and interaction effects Model selection for models including interaction effects must follow the following two principles: - If an interaction is included in the model, the main effects of both of those variables must also be in the model. - If a main effect is not in the model, then its interaction should not be in the model. --- ## Other model selection criteria - Adjusted `\(R^2\)` is one model selection criterion - There are others out there (many many others!), we'll discuss some later in the course, and some you might see in other courses --- ## Exploration 2 ### Price, surface area, material, shape .small[ ```r m_sms <- lm(log(price) ~ Surface + mat + Shape, data = pp) tidy(m_sms) %>% print(n = 25) ``` ``` ## # A tibble: 25 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 3.97 2.23 1.78 7.58e- 2 ## 2 Surface 0.000349 0.0000379 9.21 5.76e-20 ## 3 matal 0.860 2.23 0.385 7.00e- 1 ## 4 matar -1.27 2.23 -0.567 5.71e- 1 ## 5 matb 1.59 1.30 1.22 2.21e- 1 ## 6 matbr 0.982 1.48 0.664 5.07e- 1 ## 7 matc 1.81 1.30 1.39 1.66e- 1 ## 8 matca -0.538 1.83 -0.295 7.68e- 1 ## 9 matco -0.342 1.50 -0.228 8.20e- 1 ## 10 mate 0.996 2.23 0.446 6.56e- 1 ## 11 matg 0.603 1.67 0.361 7.18e- 1 ## 12 math -1.04 1.58 -0.657 5.11e- 1 ## 13 matm 3.24 2.26 1.43 1.52e- 1 ## 14 matmi 3.44 2.25 1.53 1.26e- 1 ## 15 mato 2.43 1.58 1.54 1.25e- 1 ## 16 matp 0.782 1.34 0.583 5.60e- 1 ## 17 matpa 2.95 1.50 1.97 4.87e- 2 ## 18 matt 0.885 1.30 0.681 4.96e- 1 ## 19 matta 0.521 1.31 0.396 6.92e- 1 ## 20 matv 0.531 1.41 0.376 7.07e- 1 ## 21 Shapeoval -0.798 1.85 -0.430 6.67e- 1 ## 22 Shapeovale -0.272 1.86 -0.146 8.84e- 1 ## 23 Shaperonde -0.170 1.99 -0.0856 9.32e- 1 ## 24 Shaperound -1.70 1.83 -0.929 3.53e- 1 ## 25 Shapesqu_rect -0.226 1.82 -0.124 9.01e- 1 ``` ] --- ## Shape and material Collapse levels of `Shape` and `mat`erial variables with `forcats::fct_collapse`: .small[ ```r pp <- pp %>% mutate( Shape = fct_collapse(Shape, oval = c("oval", "ovale"), round = c("round", "ronde"), squ_rect = "squ_rect", other = c("octogon", "octagon", "miniature")), mat = fct_collapse(mat, metal = c("a", "br", "c"), canvas = c("co", "t", "ta"), paper = c("p", "ca"), wood = "b", other = c("e", "g", "h", "mi", "o", "pa", "v", "al", "ar", "m")) ) ``` ] --- ## Review fixes .pull-left[ ```r pp %>% count(Shape) ``` ``` ## # A tibble: 5 x 2 ## Shape n ## <fct> <int> ## 1 other 12 ## 2 oval 52 ## 3 round 74 ## 4 squ_rect 3219 ## 5 <NA> 36 ``` ] .pull-right[ ```r pp %>% count(mat) ``` ``` ## # A tibble: 6 x 2 ## mat n ## <fct> <int> ## 1 metal 321 ## 2 other 59 ## 3 wood 886 ## 4 paper 38 ## 5 canvas 1783 ## 6 <NA> 306 ``` ] --- ## Refit model .question[ Interpret the slope for `matother`. *Hint:* What is the baseline level for the material variable? ] ```r m_sms <- lm(log(price) ~ Surface + mat + Shape, data = pp) tidy(m_sms) ``` ``` ## # A tibble: 9 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 5.75 1.83 3.15 1.66e- 3 ## 2 Surface 0.000344 0.0000377 9.14 1.17e-19 ## 3 matother -0.643 0.333 -1.93 5.38e- 2 ## 4 matwood -0.189 0.120 -1.58 1.14e- 1 ## 5 matpaper -1.10 0.343 -3.21 1.32e- 3 ## 6 matcanvas -0.916 0.115 -7.95 2.52e-15 ## 7 Shapeoval -0.519 1.84 -0.282 7.78e- 1 ## 8 Shaperound -1.62 1.84 -0.884 3.77e- 1 ## 9 Shapesqu_rect -0.223 1.82 -0.123 9.02e- 1 ``` --- ## Change the baseline level Change order of levels of the `mat` variable with `forcats::fct_relevel`: ```r pp <- pp %>% mutate(mat = fct_relevel(mat, "other")) levels(pp$mat) ``` ``` ## [1] "other" "metal" "wood" "paper" "canvas" ``` --- ## Refit model .question[ Interpret the slope for `matmetal`. *Hint:* What is the baseline level for the material variable? ] ```r m_sms <- lm(log(price) ~ Surface + mat + Shape, data = pp) tidy(m_sms) ``` ``` ## # A tibble: 9 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 5.10 1.85 2.76 5.84e- 3 ## 2 Surface 0.000344 0.0000377 9.14 1.17e-19 ## 3 matmetal 0.643 0.333 1.93 5.38e- 2 ## 4 matwood 0.454 0.324 1.40 1.61e- 1 ## 5 matpaper -0.460 0.457 -1.01 3.14e- 1 ## 6 matcanvas -0.273 0.322 -0.845 3.98e- 1 ## 7 Shapeoval -0.519 1.84 -0.282 7.78e- 1 ## 8 Shaperound -1.62 1.84 -0.884 3.77e- 1 ## 9 Shapesqu_rect -0.223 1.82 -0.123 9.02e- 1 ``` --- ## Application exercise: ### `ae-10-model-selection-pp` - First, plan the full model - Then, fit, evaluate, select - Lastly, interpret the final model