class: center, middle, inverse, title-slide # Model fit and model selection ## Intro to Data Science ### Shawn Santo ### 02-27-20 --- ## Announcements - Homework 3 assigned today (check website) - Lab 5 due Friday, Feb 28 at 11:59pm - I'll post a reading to be completed for next lecture. --- class: center, middle, inverse # Recall --- ## Multiple linear regression - We want to investigate how a collection of predictors are associated with a response. - Our objective is to find the best fit regression surface, as opposed to the best fit line when we have only one predictor. - To assess linearity we can no longer use an `x` v. `y` scatter plot. We'll need to use residual plots. - Interpretation changes slightly. For interpreting a single coefficient, we hold all others constant. --- ## Interacting explanatory variables - Including an interaction effect in the model allows for different slopes, i.e. nonparallel lines. - This implies that the regression coefficient for an explanatory variable would change as another explanatory variable changes. - This can be accomplished by adding an **interaction variable**: the product of two explanatory variables. --- ## Load data; create models Reminder: we're trying to create a model to predict log price based on surface area and whether the artist is still alive. ```r library(tidyverse) library(broom) paris_paintings <- read_csv("data/paris_paintings.csv", na = c("n/a", "", "NA")) pp_surf_5000 <- paris_paintings %>% filter(Surface < 5000) m_main <- lm(log(price) ~ Surface + factor(artistliving), data = pp_surf_5000) m_int <- lm(log(price) ~ Surface * factor(artistliving), data = pp_surf_5000) ``` --- class: center, middle, inverse # Quality of fit in MLR --- ## `\(R^2\)` - `\(R^2\)` is the percentage of variability in the response variable explained by the regression model. ```r glance(m_main) %>% pull(r.squared) ``` ``` #> [1] 0.01320884 ``` ```r glance(m_int) %>% pull(r.squared) ``` ``` #> [1] 0.0176922 ``` -- - Clearly the model with interactions has a higher `\(R^2\)`. -- - However using `\(R^2\)` for model selection in models with multiple explanatory variables is not a good idea as `\(R^2\)` increases when **any** variable is added to the model. --- ## `\(R^2\)` - first principles - We can phrase explained variance using the following ratio of sums of squares: $$ R^2 = 1 - \left( \frac{ SS\_{Error} }{ SS\_{Total} } \right) $$ where `\(SS_{Error}\)` is the sum of square residuals and `\(SS_{Total}\)` is the total variance in the response variable. - But remember, adding ANY explanatory variable will always increase `\(R^2\)` --- ## Adjusted `\(R^2\)` $$ R^2\_{adj} = 1 - \left( \frac{ SS\_{Error} }{ SS\_{Total} } \times \frac{n - 1}{n - k - 1} \right), $$ where `\(n\)` is the number of observations and `\(k\)` is the number of predictors in the model. -- - Adjusted `\(R^2\)` doesn't increase if the new variable does not provide any new information or is completely unrelated. -- - This makes adjusted `\(R^2\)` a preferable metric for model selection in multiple regression models. --- ## Occam's Razor - Occam's Razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected. - Model selection follows this principle. - We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model. - In other words, we prefer the simplest best model, i.e. <font class="vocab">parsimonious</font> model. --- ## Comparing models It appears that adding the interaction actually increased adjusted `\(R^2\)`, so we should indeed use the model with the interactions. ```r glance(m_main) %>% pull(adj.r.squared) ``` ``` #> [1] 0.01258977 ``` ```r glance(m_int) %>% pull(adj.r.squared) ``` ``` #> [1] 0.01676753 ``` --- class: center, middle, inverse # Model selection --- ## Backwards elimination - Start with a **full** model (including all candidate explanatory variables and all candidate interactions). - Remove one variable at a time, and select the model with the highest adjusted `\(R^2\)`. - Continue until adjusted `\(R^2\)` does not increase. --- ## Forward selection - Start with an **empty** model. - Add one variable (or interaction effect) at a time, and select the model with the highest adjusted `\(R^2\)`. - Continue until adjusted `\(R^2\)` does not increase. --- ## Model selection and interaction effects If an interaction is included in the model, the main effects of both of those variables must also be in the model. If a main effect is not in the model, then its interaction should not be in the model. --- ## Other model selection criteria - Adjusted `\(R^2\)` is only one model selection criterion. - There are many others out there - AIC and BIC are commonly used ones. --- ## Your turn We will be revisiting the candy rankings data from HW 01 **Project:** Model selection for candy data **Goal:** Come up with the model that "best" predicts win percentage of candies --- ## Planning Decide on a subset of variables to consider for your analysis. - Consider 7-10 total variables, including interactions. - Consider at least two interactions in the model. - Interactions should be between a categorical variable and a numeric variable. - The more variables you consider the longer model selection will take so keep that in mind. --- ## Task: - Use backwards elimination to do model selection. <b>Make sure to show each step of the decision process.</b> You don't have to interpret the models at each stage. - Provide interpretations for the slopes for your final model. - Further instructions are given in the .Rmd file on the application exercise, available here https://classroom.github.com/a/p7Ej9Qm8