class: center, middle, inverse, title-slide # Model Selection ### Dr. Maria Tackett ### 02.27.19 --- ## Announcements - Labs resume on Friday --- ### Which variables should be in the model? - This is a very hard question that is the subject of a lot of statistical research - There are many different opinions about how to answer this question - This lecture will mostly focus on how to approach variable selection; we will introduce specific methods - See 6.1.2 and 6.1.3 in *ISLR* for more information about specific methods --- ### Which variables should you include? - It depends on the goal of your analysis - Though a variable selection procedure will select one set of variables for the model, that set is usually one of several equally good sets - It is best to start with a well-defined question to help guide the variable selection --- ## Prediction - <font class="vocab">Goal: </font> to calculate the most precise prediction of the response variable - Interpreting coefficients is **not** important - Choose only the variables that are strong predictors of the response variable + Excluding irrelevant variables can help reduce widths of the prediction intervals --- ### One variable's effect - <font class="vocab">Goal: </font>Understand one variable's effect on the response after adjusting for other factors - Only interpret the coefficient of the variable that is the focus of the study + Interpreting the coefficients of the other variables is **not** important - Any variables not selected for the final model have still been adjusted for, since they had a chance to be in the model --- ### Explanation - <font class="vocab">Goal: </font>Identify variables that are important in explaining variation in the response - Interpret any variables of interest - Include all variables you think are related to the response, even if they are not statistically significant + This improves the interpretation of the coefficients of interest - Interpret the coefficients with caution, especially if there are problems with multicollinearity in the model --- ## Example: SAT Averages by State - This data set contains the average SAT score (out of 1600) and other variables that may be associated with SAT performance for each of the 50 U.S. states. The data is based on test takers for the 1982 exam. - <font class="vocab">Data: </font> `case1201` data set in the `Sleuth3` package - Response variable: + <font class="vocab">`SAT`</font>: average total SAT score --- ### SAT Averages: Explanatory Variables - <font class="vocab">`State`</font>: U.S. State - <font class="vocab">`Takers`</font>: percentage of high school seniors who took exam - <font class="vocab">`Income`</font>: median income of families of test-takers ($ hundreds) - <font class="vocab">`Years`</font>: average number of years test-takers had formal education in social sciences, natural sciences, and humanities - <font class="vocab">`Public`</font>: percentage of test-takers who attended public high schools - <font class="vocab">`Expend`</font>: total state expenditure on high schools ($ hundreds per student) - <font class="vocab">`Rank`</font>: median percentile rank of test-takers within their high school classes --- ## Model Selection Practice - Go to the link: [http://bit.ly/sta210-model-selection](http://bit.ly/sta210-model-selection) <br> - For each scenario, identify the primary modeling objective. --- ### Practice: What's the primary objective? Suppose you are on a legislative watchdog committee, and you want to determine the impact of state expenditures on state SAT scores. You decide to build a regression model for this purpose. - What is the primary modeling objective? - One variable's effect - Prediction - Explanation - What strategy would you use to select variables for the model? --- ### Practice: What's the primary objective? Suppose you are on a committee tasked with improving the average SAT scores for your state. You have already determined that the number of test takers is an important variable, so you decide to include it in the regression model. Now you want to know what other variables significantly impact the average SAT score after accounting for the number of test takers. - What is the primary modeling objective? - One variable's effect - Prediction - Explanation - What strategy would you use to select variables for the model? --- ## Model Selection Criterion - <font class="vocab">Akaike's Information Criterion (AIC):</font> `$$AIC = n\log(RSS) - n \log(n) + 2(p+1)$$` <br> - <font class="vocab">Schwarz's Bayesian Information Criterion (BIC): </font> `$$BIC = n\log(RSS) - n\log(n) + log(n)\times(p+1)$$` <br> <br> See the [supplemental note](https://github.com/STA210-Sp19/supplemental-notes/blob/master/models-selection-criteria.pdf) on AIC & BIC for derivations. --- ## AIC & BIC `$$\begin{aligned} & AIC = \color{blue}{n\log(RSS)} - \color{green}{n \log(n)} + \color{red}{2(p+1)} \\ & BIC = \color{blue}{n\log(RSS)} - \color{green}{n\log(n)} + \color{red}{\log(n)\times(p+1)} \end{aligned}$$` <br> <br> - <font color="blue">First Term: </font>Decreases as *p* increases - <font color="green">Second Term: </font> Fixed for a given sample size *n* - <font color="red">Third Term: </font> Increases as *p* increases --- ## Using AIC & BIC `$$\begin{aligned} & AIC = n\log(RSS) - n \log(n) + \color{red}{2(p+1)} \\ & BIC = n\log(RSS) - n\log(n) + \color{red}{\log(n)\times(p+1)} \end{aligned}$$` <br> <br> - Choose model with smallest AIC or BIC - If `\(n \geq 8\)`, the <font color="red">penalty</font> for BIC is larger than that of AIC, so BIC tends to favor *more parsimonious* models (i.e. models with fewer terms) --- ## Backward Selection - Start with model that includes all variables of interest - Drop variables one at a time that are deemed irrelevant based on some criterion. Common criterion include + Drop variable with highest p-value over some threshold (e.g. 0.05, 0.1) + Drop variable that leads to smallest value of AIC or BIC - Stop when no more variables can be removed from the model based on the criterion --- ## Forward Selection - Start with the intercept-only model - Include variables one at a time based on some criterion. Common criterion include + Add variable with smallest p-value under some threshold (e.g. 0.05, 0.1) + Add variable that leads to the smallest value of AIC or BIC - Stop when no more variables can be added to the model based on the criterion --- ## Stepwise Selection (Hybrid) - Start with intercept-only model - Conduct one forward step to potentially add a variable to the model based on some criterion - Conduct one backward step to potentially remove a variable from the model based on some criterion - Stop when no other variables can be added or removed from the model --- ### Caution! - Different automated selection methods may choose different models - You may miss key transformations or interaction effects that are not selected by the automated procedure - You may find models that have no scientific use, if automation rather than science is used to select model - Standard errors for the coefficients are difficult to interpret, since there is additional variability from the model selection procedure that should also be accounted for --- ## Model Selection in R - Use **`step`** function for forward, backward, and stepwise selection using AIC as the selection criteria - Use **`regsubsets`** function in the **leaps** package for forward, backward, and stepwise selection using BIC or Adj. `\(R^2\)` as the selection criteria --- ### `step` function (AIC) ```r null_model <- lm(Y ~ 1, data = my_data) full_model <- lm(Y ~ ., data = my_data) ``` - Forward selection ```r regfit_forward <- step(null_model, scope = formula(full_model), direction = "forward") ``` - Backward selection ```r regfit_backward <- step(full_model, direction = "backward") ``` - Stepwise (hybrid) selection ```r regfit_hybrid <- step(null_model, scope = formula(full_model), direction = "both") ``` --- ### `regsubsets` function (BIC, Adj. `\(R^2\)`) - Forward selection ```r regfit_forward <- regsubsets(Y ~ ., data = my_data, method="forward") ``` - Backward selection ```r regfit_backward <- regsubsets(Y ~ ., data = my_data, method="backward") ``` - Choose the best model: - Code shown for forward selection; use similar code for backward selection ```r sel_summary <- summary(regfit_forward) coef(regfit_forward, which.max(sel_summary$adjr2)) # Adj R-sq coef(regfit_forward, which.min(sel_summary$bic)) # BIC ```