class: center, middle, inverse, title-slide # Variable selection ### Yue Jiang ### STA 210 / Duke University / Spring 2023 --- ### Variable selection It is what it sounds like - choosing which set of variables to include in a given model. Note the use of "given model" - variable selection is *not* the same as model selection. Last time, we talked about **cross-validation**, which was a set of techniques that we used to evaluate prediction strength of proposed models by looking at how well our model might perform on "new" data. These are often used as model selection tools (again, distinct from *variable* selection). Today we will focus on some additional techniques that you might consider when faced with an open-ended modeling task. --- ### All subset selection **All subset selection** is exactly as it sounds - take all possible combinations of variables (perhaps of a given model size) and see which combination of variables leads to the "best" result. .question[ How might we measure "best"? ] --- ### R^2 and adjusted R^2 `$$R^2 = 1 - \left( \frac{ SS_{Error} }{ SS_{Total} } \right)$$` where `\(SS_{Error}\)` is the sum of squared residuals and `\(SS_{Total}\)` is the total variance in the response variable. Remember that `\(R^2\)` never decreases when **any** variable is added to the model, even if it's completely unrelated. `\(R^2_{adj}\)` tries to address this by creating a penalty term related to how many parameters are being estimated by the model: `$$R^2_{adj} = 1 - \left( \frac{ SS_{Error} }{ SS_{Total} } \times \frac{n - 1}{n - k - 1} \right),$$` where `\(n\)` is the number of observations and `\(k\)` is the number of predictors in the model. --- ### Mallow's Cp `$$C_p = \frac{1}{n}(SS_{Error}) + \frac{2nk}{\hat{\sigma}^2}$$` where `\(\hat{\sigma}^2\)` an estimate of the variance of the response variable in a model that contains **all** predictors (and `\(k\)` is once again the number of predictors in the model) Mallow's `\(C_p\)` is another statistic that might be used to choose between different subsets of predictors to include in a model. Like `\(R^2_{adj}\)`, there is a "penalty" of sorts for including more predictors in the model. .question[ Do we want a higher or lower `\(C_p\)`? ] --- ### AIC and BIC `\begin{align*} AIC &= n \log(SS_{Error}/n) + 2k\\ BIC &= n \log(SS_{Error}/n) + k \log(n) \end{align*}` Again, we see different levels of "penalties" applied, based on the number of parameters being estimated. They also have various mathematical properties and are "optimal" in certain ways, but this is beyond the scope of STA 210. We can use these methods to **compare non-nested models**, but there is no associated formal hypothesis test. .question[ What are these penalties (literally) as they relate to the number of parameters? Which do you think more "severely" penalizes having many predictors? ] --- ### Back to all subset selection ``` ## PM2.5 TEMP PRES DEWP RAIN WSPM ## 1 4 -0.7 1023.0 -18.8 0 4.4 ## 2 8 -1.1 1023.2 -18.2 0 4.7 ## 3 7 -1.1 1023.5 -18.2 0 5.6 ## 4 6 -1.4 1024.5 -19.4 0 3.1 ## 5 3 -2.0 1025.2 -19.5 0 2.0 ## 6 5 -2.2 1025.6 -19.6 0 3.7 ``` --- ### Back to all subset selection ```r library(leaps) m_all <- regsubsets(PM2.5 ~ TEMP + PRES + DEWP + RAIN + WSPM, data = beijing, nbest = 1, nvmax = 5) m_all ``` ``` ## Subset selection object ## Call: regsubsets.formula(PM2.5 ~ TEMP + PRES + DEWP + RAIN + WSPM, ## data = beijing, nbest = 1, nvmax = 5) ## 5 Variables (and intercept) ## Forced in Forced out ## TEMP FALSE FALSE ## PRES FALSE FALSE ## DEWP FALSE FALSE ## RAIN FALSE FALSE ## WSPM FALSE FALSE ## 1 subsets of each size up to 5 ## Selection Algorithm: exhaustive ``` --- ### Back to all subset selection ```r summary(m_all) ``` ``` ## Subset selection object ## Call: regsubsets.formula(PM2.5 ~ TEMP + PRES + DEWP + RAIN + WSPM, ## data = beijing, nbest = 1, nvmax = 5) ## 5 Variables (and intercept) ## Forced in Forced out ## TEMP FALSE FALSE ## PRES FALSE FALSE ## DEWP FALSE FALSE ## RAIN FALSE FALSE ## WSPM FALSE FALSE ## 1 subsets of each size up to 5 ## Selection Algorithm: exhaustive ## TEMP PRES DEWP RAIN WSPM ## 1 ( 1 ) " " " " " " " " "*" ## 2 ( 1 ) "*" " " "*" " " " " ## 3 ( 1 ) "*" "*" "*" " " " " ## 4 ( 1 ) "*" "*" "*" "*" " " ## 5 ( 1 ) "*" "*" "*" "*" "*" ``` --- ### Back to all subset selection ```r summary(m_all)$which ``` ``` ## (Intercept) TEMP PRES DEWP RAIN WSPM ## 1 TRUE FALSE FALSE FALSE FALSE TRUE ## 2 TRUE TRUE FALSE TRUE FALSE FALSE ## 3 TRUE TRUE TRUE TRUE FALSE FALSE ## 4 TRUE TRUE TRUE TRUE TRUE FALSE ## 5 TRUE TRUE TRUE TRUE TRUE TRUE ``` --- ### Back to all subset selection ```r summary(m_all)$rsq ``` ``` ## [1] 0.07768534 0.17531794 0.18249650 0.18476412 0.18655102 ``` --- ### Back to all subset selection ```r summary(m_all)$adjr2 ``` ``` ## [1] 0.07765831 0.17526959 0.18242461 0.18466853 0.18643179 ``` --- ### Back to all subset selection ```r summary(m_all)$cp ``` ``` ## [1] 4563.41840 471.07339 172.03160 78.93604 6.00000 ``` --- ### Back to all subset selection ```r summary(m_all)$bic ``` ``` ## [1] -2738.289 -6545.375 -6833.231 -6917.566 -6981.995 ``` Quick note, in many software packages AIC and BIC are reported as -2 times the log-likelihood + some penalty. In this case, we want to maximize the log- likelihood (thus *minimizing* negative 2 times that). Thus, in the above output, we want to have the *smallest* value. This is again for the case with five predictors. --- ### Forward selection and backward elimination These are **stepwise** methods that add or remove variables from a model sequentially based on some criterion. **Forward** selection: 1. Start with an intercept-only model 2. Consider all models which additionally have one more predictor 3. Choose the model that is the "best" according to some criterion (we've seen a lot now) 4. If the model improves, then choose that one, and cycle back to step 2. Otherwise, if no improvement is seen, stop at the last model chosen. --- ### Forward selection and backward elimination **Backward** elimination: 1. Start with a "full" model 2. Consider all models which have one predictor taken away 3. Choose the model that is the "best" according to some criterion (we've seen a lot now) 4. If the model improves, then choose that one, and cycle back to step 2. Otherwise, if no improvement is seen, stop at the last model chosen. (we can also do both directions at once, adding or subtracting variables as needed sequentially) --- ### Stepwise methods ```r m_none <- lm(PM2.5 ~ 1, data = beijing) m_all <- lm(PM2.5 ~ TEMP + PRES + DEWP + RAIN + WSPM, data = beijing) ``` --- ### Stepwise methods ```r library(MASS) stepAIC(m_none, scope = list(lower = m_none, upper = m_all), data = beijing, direction = "forward") ``` ``` ## Start: AIC=300832.2 ## PM2.5 ~ 1 ## ## Df Sum of Sq RSS AIC ## + WSPM 1 17887405 212367166 298075 ## + TEMP 1 3856377 226398194 300258 ## + DEWP 1 3321831 226932740 300338 ## + RAIN 1 45155 230209416 300827 ## <none> 230254571 300832 ## + PRES 1 7552 230247019 300833 ## ## Step: AIC=298075 ## PM2.5 ~ WSPM ## ## Df Sum of Sq RSS AIC ## + TEMP 1 3271904 209095262 297547 ## + DEWP 1 305286 212061880 298028 ## + PRES 1 49916 212317249 298069 ## + RAIN 1 13434 212353731 298075 ## <none> 212367166 298075 ## ## Step: AIC=297547.2 ## PM2.5 ~ WSPM + TEMP ## ## Df Sum of Sq RSS AIC ## + DEWP 1 19724119 189371143 294169 ## + PRES 1 5308810 203786452 296672 ## <none> 209095262 297547 ## + RAIN 1 2695 209092567 297549 ## ## Step: AIC=294168.7 ## PM2.5 ~ WSPM + TEMP + DEWP ## ## Df Sum of Sq RSS AIC ## + PRES 1 1635802 187735340 293875 ## + RAIN 1 386391 188984752 294101 ## <none> 189371143 294169 ## ## Step: AIC=293874.7 ## PM2.5 ~ WSPM + TEMP + DEWP + PRES ## ## Df Sum of Sq RSS AIC ## + RAIN 1 434995 187300346 293798 ## <none> 187735340 293875 ## ## Step: AIC=293797.5 ## PM2.5 ~ WSPM + TEMP + DEWP + PRES + RAIN ``` ``` ## ## Call: ## lm(formula = PM2.5 ~ WSPM + TEMP + DEWP + PRES + RAIN, data = beijing) ## ## Coefficients: ## (Intercept) WSPM TEMP DEWP PRES RAIN ## 1426.074 -3.611 -5.449 3.641 -1.259 -3.918 ``` --- ### Stepwise methods ```r stepAIC(m_all, scope = list(lower = m_none, upper = m_all), data = beijing, direction = "backward") ``` ``` ## Start: AIC=293797.5 ## PM2.5 ~ TEMP + PRES + DEWP + RAIN + WSPM ## ## Df Sum of Sq RSS AIC ## <none> 187300346 293798 ## - WSPM 1 411443 187711788 293870 ## - RAIN 1 434995 187735340 293875 ## - PRES 1 1684407 188984752 294101 ## - DEWP 1 16450903 203751248 296668 ## - TEMP 1 23549408 210849754 297836 ``` ``` ## ## Call: ## lm(formula = PM2.5 ~ TEMP + PRES + DEWP + RAIN + WSPM, data = beijing) ## ## Coefficients: ## (Intercept) TEMP PRES DEWP RAIN WSPM ## 1426.074 -5.449 -1.259 3.641 -3.918 -3.611 ``` --- ### Stepwise methods ```r stepAIC(m_none, scope = list(lower = m_none, upper = m_all), data = beijing, direction = "both") ``` ``` ## Start: AIC=300832.2 ## PM2.5 ~ 1 ## ## Df Sum of Sq RSS AIC ## + WSPM 1 17887405 212367166 298075 ## + TEMP 1 3856377 226398194 300258 ## + DEWP 1 3321831 226932740 300338 ## + RAIN 1 45155 230209416 300827 ## <none> 230254571 300832 ## + PRES 1 7552 230247019 300833 ## ## Step: AIC=298075 ## PM2.5 ~ WSPM ## ## Df Sum of Sq RSS AIC ## + TEMP 1 3271904 209095262 297547 ## + DEWP 1 305286 212061880 298028 ## + PRES 1 49916 212317249 298069 ## + RAIN 1 13434 212353731 298075 ## <none> 212367166 298075 ## - WSPM 1 17887405 230254571 300832 ## ## Step: AIC=297547.2 ## PM2.5 ~ WSPM + TEMP ## ## Df Sum of Sq RSS AIC ## + DEWP 1 19724119 189371143 294169 ## + PRES 1 5308810 203786452 296672 ## <none> 209095262 297547 ## + RAIN 1 2695 209092567 297549 ## - TEMP 1 3271904 212367166 298075 ## - WSPM 1 17302932 226398194 300258 ## ## Step: AIC=294168.7 ## PM2.5 ~ WSPM + TEMP + DEWP ## ## Df Sum of Sq RSS AIC ## + PRES 1 1635802 187735340 293875 ## + RAIN 1 386391 188984752 294101 ## <none> 189371143 294169 ## - WSPM 1 515672 189886814 294259 ## - DEWP 1 19724119 209095262 297547 ## - TEMP 1 22690737 212061880 298028 ## ## Step: AIC=293874.7 ## PM2.5 ~ WSPM + TEMP + DEWP + PRES ## ## Df Sum of Sq RSS AIC ## + RAIN 1 434995 187300346 293798 ## <none> 187735340 293875 ## - WSPM 1 498578 188233919 293963 ## - PRES 1 1635802 189371143 294169 ## - DEWP 1 16051112 203786452 296672 ## - TEMP 1 23149487 210884828 297840 ## ## Step: AIC=293797.5 ## PM2.5 ~ WSPM + TEMP + DEWP + PRES + RAIN ## ## Df Sum of Sq RSS AIC ## <none> 187300346 293798 ## - WSPM 1 411443 187711788 293870 ## - RAIN 1 434995 187735340 293875 ## - PRES 1 1684407 188984752 294101 ## - DEWP 1 16450903 203751248 296668 ## - TEMP 1 23549408 210849754 297836 ``` ``` ## ## Call: ## lm(formula = PM2.5 ~ WSPM + TEMP + DEWP + PRES + RAIN, data = beijing) ## ## Coefficients: ## (Intercept) WSPM TEMP DEWP PRES RAIN ## 1426.074 -3.611 -5.449 3.641 -1.259 -3.918 ``` --- ### Stepwise methods Potential issues with stepwise methods: - Algorithms are greedy - no guarantee the "best" model will be found - Often doesn't work well with highly correlated variables - Order often matters and results in different final models chosen - "Only uses the data" without thought for existing scientific knowledge --- ### LASSO **LASSO**, or least absolute shrinkage and **selection** operator, is a regression technique that also happens to perform variable selection. Instead of minimizing the residual sum of squares like in OLS: `$$\min_\boldsymbol\beta \sum_{i = 1}^n (y_i - \mathbf{x}^T_i\boldsymbol\beta)^2$$` LASSO minimizes the following: `$$\min_\boldsymbol\beta \left(\sum_{i = 1}^n (y_i - \mathbf{x}^T_i\boldsymbol\beta)^2 + \lambda \sum_{k = 1}^p |\beta_p| \right)$$` Because of this additional "penalty," some of the `\(\beta_k\)` terms can be set to zero, thus performing variable selection. Often `\(\lambda\)` is chosen to optimize some criterion (e.g., predictive performance, perhaps using cross-validation) --- ### LASSO ```r y <- beijing$PM2.5 x <- model.matrix(PM2.5 ~ TEMP + PRES + DEWP + RAIN + WSPM, data = beijing) library(glmnet) # (has LASSO-related functions here) m_lasso_cv <- cv.glmnet(x, y, alpha = 1) ``` In the above code, we're performing k-fold cross validation to choose the "best" `\(\lambda\)` value (in this case, by minimizing MSE in the test sets from the cross-validation). The argument `alpha = 1` corresponds to LASSO. --- ### LASSO ```r best_lambda <- m_lasso_cv$lambda.min best_lambda ``` ``` ## [1] 0.04932974 ``` --- ### LASSO ```r plot(m_lasso_cv) ``` <!-- --> --- ### LASSO ```r m_best <- glmnet(x, y, alpha = 1, lambda = best_lambda) m_best$beta ``` ``` ## 6 x 1 sparse Matrix of class "dgCMatrix" ## s0 ## (Intercept) . ## TEMP -5.407856 ## PRES -1.241963 ## DEWP 3.617576 ## RAIN -3.840631 ## WSPM -3.679301 ``` If LASSO ended up **shrinking** any coefficient estimates to zero, then we wouldn't see them here. In this case, LASSO ended up choosing to keep all variables, much like the other examples we've seen. Note also that there is no intercept term estimated here - there is some pre-processing of the variables to center them (mean zero) and so the intercept would be set to zero. --- ### Categorical predictors For categorical variables, LASSO considers them "as a group" - an all or nothing sort of deal. The stepwise functions as implemented in R also treat categorical predictors as a group. However, the best subset selection function shown today treats each of the dummy variables separately. --- ### A last word **Use your domain knowledge**. Don't blindly use these sorts of "automatic" variable selection techniques (unless you specifically are interested in prediction and *only* prediction). - Look at your data! - Don't forget about interpretations (especially if transformations are involved) - Write down what your model actually represents - **Use your domain knowledge** (I can't stress this enough) **Use your domain knowledge.**