Variable selection

class: center, middle, inverse, title-slide

# Variable selection
### Yue Jiang
### STA 210 / Duke University / Spring 2023

---

### Variable selection

It is what it sounds like - choosing which set of variables to include in a 
given model. Note the use of "given model" - variable selection is *not* the
same as model selection.

Last time, we talked about **cross-validation**, which was a set of
techniques that we used to evaluate prediction strength of proposed models by
looking at how well our model might perform on "new" data. These are often
used as model selection tools (again, distinct from *variable* selection).

Today we will focus on some additional techniques that you might consider when
faced with an open-ended modeling task.

---

### All subset selection

**All subset selection** is exactly as it sounds - take all possible 
combinations of variables (perhaps of a given model size) and see which
combination of variables leads to the "best" result.

.question[
How might we measure "best"?
]

---

### R^2 and adjusted R^2

`$$R^2 =  1 - \left( \frac{ SS_{Error} }{ SS_{Total} } \right)$$`

where `\(SS_{Error}\)` is the sum of squared residuals and `\(SS_{Total}\)` is the total
variance in the response variable.

Remember that `\(R^2\)` never decreases when 
**any** variable is added to the model, even if it's completely unrelated. `\(R^2_{adj}\)` tries to address this by creating a penalty term related to how many
parameters are being estimated by the model:

`$$R^2_{adj} = 1 - \left( \frac{ SS_{Error} }{ SS_{Total} } \times \frac{n - 1}{n - k - 1} \right),$$`

where `\(n\)` is the number of observations and `\(k\)` is the number of predictors in 
the model.

---

### Mallow's Cp

`$$C_p = \frac{1}{n}(SS_{Error}) + \frac{2nk}{\hat{\sigma}^2}$$`

where `\(\hat{\sigma}^2\)` an estimate of the variance of the response variable in a
model that contains **all** predictors (and `\(k\)` is once again the number of predictors
in the model)

Mallow's `\(C_p\)` is another statistic that might be used to choose between 
different subsets of predictors to include in a model. Like `\(R^2_{adj}\)`, there
is a "penalty" of sorts for including more predictors in the model.

.question[
Do we want a higher or lower `\(C_p\)`?
]

---

### AIC and BIC

`\begin{align*}
AIC &= n \log(SS_{Error}/n) + 2k\\
BIC &= n \log(SS_{Error}/n) + k \log(n)
\end{align*}`

Again, we see different levels of "penalties" applied, based on the number of
parameters being estimated. They also have various mathematical properties and
are "optimal" in certain ways, but this is beyond the scope of STA 210.

We can use these methods to **compare non-nested models**, but there is no
associated formal hypothesis test.

.question[
What are these penalties (literally) as they relate to the number of parameters?
Which do you think more "severely" penalizes having many predictors?
]

---

### Back to all subset selection

```
##   PM2.5 TEMP   PRES  DEWP RAIN WSPM
## 1     4 -0.7 1023.0 -18.8    0  4.4
## 2     8 -1.1 1023.2 -18.2    0  4.7
## 3     7 -1.1 1023.5 -18.2    0  5.6
## 4     6 -1.4 1024.5 -19.4    0  3.1
## 5     3 -2.0 1025.2 -19.5    0  2.0
## 6     5 -2.2 1025.6 -19.6    0  3.7
```

---

### Back to all subset selection

```r
library(leaps)
m_all <- regsubsets(PM2.5 ~ TEMP + PRES + DEWP + RAIN + WSPM,
                  data = beijing, 
                  nbest = 1, nvmax = 5)
m_all
```

```
## Subset selection object
## Call: regsubsets.formula(PM2.5 ~ TEMP + PRES + DEWP + RAIN + WSPM, 
##     data = beijing, nbest = 1, nvmax = 5)
## 5 Variables  (and intercept)
##      Forced in Forced out
## TEMP     FALSE      FALSE
## PRES     FALSE      FALSE
## DEWP     FALSE      FALSE
## RAIN     FALSE      FALSE
## WSPM     FALSE      FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
```

---

### Back to all subset selection

```r
summary(m_all)
```

```
## Subset selection object
## Call: regsubsets.formula(PM2.5 ~ TEMP + PRES + DEWP + RAIN + WSPM, 
##     data = beijing, nbest = 1, nvmax = 5)
## 5 Variables  (and intercept)
##      Forced in Forced out
## TEMP     FALSE      FALSE
## PRES     FALSE      FALSE
## DEWP     FALSE      FALSE
## RAIN     FALSE      FALSE
## WSPM     FALSE      FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
##          TEMP PRES DEWP RAIN WSPM
## 1  ( 1 ) " "  " "  " "  " "  "*" 
## 2  ( 1 ) "*"  " "  "*"  " "  " " 
## 3  ( 1 ) "*"  "*"  "*"  " "  " " 
## 4  ( 1 ) "*"  "*"  "*"  "*"  " " 
## 5  ( 1 ) "*"  "*"  "*"  "*"  "*"
```

---

### Back to all subset selection

```r
summary(m_all)$which
```

```
##   (Intercept)  TEMP  PRES  DEWP  RAIN  WSPM
## 1        TRUE FALSE FALSE FALSE FALSE  TRUE
## 2        TRUE  TRUE FALSE  TRUE FALSE FALSE
## 3        TRUE  TRUE  TRUE  TRUE FALSE FALSE
## 4        TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## 5        TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
```

---

### Back to all subset selection

```r
summary(m_all)$rsq
```

```
## [1] 0.07768534 0.17531794 0.18249650 0.18476412 0.18655102
```

---

### Back to all subset selection

```r
summary(m_all)$adjr2
```

```
## [1] 0.07765831 0.17526959 0.18242461 0.18466853 0.18643179
```

---

### Back to all subset selection

```r
summary(m_all)$cp
```

```
## [1] 4563.41840  471.07339  172.03160   78.93604    6.00000
```

---

### Back to all subset selection

```r
summary(m_all)$bic
```

```
## [1] -2738.289 -6545.375 -6833.231 -6917.566 -6981.995
```

Quick note, in many software packages AIC and BIC are reported as -2 times the
log-likelihood + some penalty. In this case, we want to maximize the log-
likelihood (thus *minimizing* negative 2 times that).

Thus, in the above output, we want to have the *smallest* value. This is again
for the case with five predictors.

---

### Forward selection and backward elimination

These are **stepwise** methods that add or remove variables from a model
sequentially based on some criterion.

**Forward** selection:

1. Start with an intercept-only model
2. Consider all models which additionally have one more predictor
3. Choose the model that is the "best" according to some criterion (we've seen
a lot now)
4. If the model improves, then choose that one, and cycle back to step 2. 
Otherwise, if no improvement is seen, stop at the last model chosen.

---

### Forward selection and backward elimination

**Backward** elimination:

1. Start with a "full" model
2. Consider all models which have one predictor taken away
3. Choose the model that is the "best" according to some criterion (we've seen
a lot now)
4. If the model improves, then choose that one, and cycle back to step 2. 
Otherwise, if no improvement is seen, stop at the last model chosen.

(we can also do both directions at once, adding or subtracting variables as
needed sequentially)

---

### Stepwise methods

```r
m_none <- lm(PM2.5 ~ 1, data = beijing)
m_all <- lm(PM2.5 ~ TEMP + PRES + DEWP + RAIN + WSPM, 
            data = beijing)
```

---

### Stepwise methods

```r
library(MASS)
stepAIC(m_none, 
        scope = list(lower = m_none, upper = m_all),
        data = beijing, direction = "forward")
```

```
## Start:  AIC=300832.2
## PM2.5 ~ 1
## 
##        Df Sum of Sq       RSS    AIC
## + WSPM  1  17887405 212367166 298075
## + TEMP  1   3856377 226398194 300258
## + DEWP  1   3321831 226932740 300338
## + RAIN  1     45155 230209416 300827
## <none>              230254571 300832
## + PRES  1      7552 230247019 300833
## 
## Step:  AIC=298075
## PM2.5 ~ WSPM
## 
##        Df Sum of Sq       RSS    AIC
## + TEMP  1   3271904 209095262 297547
## + DEWP  1    305286 212061880 298028
## + PRES  1     49916 212317249 298069
## + RAIN  1     13434 212353731 298075
## <none>              212367166 298075
## 
## Step:  AIC=297547.2
## PM2.5 ~ WSPM + TEMP
## 
##        Df Sum of Sq       RSS    AIC
## + DEWP  1  19724119 189371143 294169
## + PRES  1   5308810 203786452 296672
## <none>              209095262 297547
## + RAIN  1      2695 209092567 297549
## 
## Step:  AIC=294168.7
## PM2.5 ~ WSPM + TEMP + DEWP
## 
##        Df Sum of Sq       RSS    AIC
## + PRES  1   1635802 187735340 293875
## + RAIN  1    386391 188984752 294101
## <none>              189371143 294169
## 
## Step:  AIC=293874.7
## PM2.5 ~ WSPM + TEMP + DEWP + PRES
## 
##        Df Sum of Sq       RSS    AIC
## + RAIN  1    434995 187300346 293798
## <none>              187735340 293875
## 
## Step:  AIC=293797.5
## PM2.5 ~ WSPM + TEMP + DEWP + PRES + RAIN
```

```
## 
## Call:
## lm(formula = PM2.5 ~ WSPM + TEMP + DEWP + PRES + RAIN, data = beijing)
## 
## Coefficients:
## (Intercept)         WSPM         TEMP         DEWP         PRES         RAIN  
##    1426.074       -3.611       -5.449        3.641       -1.259       -3.918
```

---

### Stepwise methods

```r
stepAIC(m_all, 
        scope = list(lower = m_none, upper = m_all),
        data = beijing, direction = "backward")
```

```
## Start:  AIC=293797.5
## PM2.5 ~ TEMP + PRES + DEWP + RAIN + WSPM
## 
##        Df Sum of Sq       RSS    AIC
## <none>              187300346 293798
## - WSPM  1    411443 187711788 293870
## - RAIN  1    434995 187735340 293875
## - PRES  1   1684407 188984752 294101
## - DEWP  1  16450903 203751248 296668
## - TEMP  1  23549408 210849754 297836
```

```
## 
## Call:
## lm(formula = PM2.5 ~ TEMP + PRES + DEWP + RAIN + WSPM, data = beijing)
## 
## Coefficients:
## (Intercept)         TEMP         PRES         DEWP         RAIN         WSPM  
##    1426.074       -5.449       -1.259        3.641       -3.918       -3.611
```

---

### Stepwise methods

```r
stepAIC(m_none, 
        scope = list(lower = m_none, upper = m_all),
        data = beijing, direction = "both")
```

```
## Start:  AIC=300832.2
## PM2.5 ~ 1
## 
##        Df Sum of Sq       RSS    AIC
## + WSPM  1  17887405 212367166 298075
## + TEMP  1   3856377 226398194 300258
## + DEWP  1   3321831 226932740 300338
## + RAIN  1     45155 230209416 300827
## <none>              230254571 300832
## + PRES  1      7552 230247019 300833
## 
## Step:  AIC=298075
## PM2.5 ~ WSPM
## 
##        Df Sum of Sq       RSS    AIC
## + TEMP  1   3271904 209095262 297547
## + DEWP  1    305286 212061880 298028
## + PRES  1     49916 212317249 298069
## + RAIN  1     13434 212353731 298075
## <none>              212367166 298075
## - WSPM  1  17887405 230254571 300832
## 
## Step:  AIC=297547.2
## PM2.5 ~ WSPM + TEMP
## 
##        Df Sum of Sq       RSS    AIC
## + DEWP  1  19724119 189371143 294169
## + PRES  1   5308810 203786452 296672
## <none>              209095262 297547
## + RAIN  1      2695 209092567 297549
## - TEMP  1   3271904 212367166 298075
## - WSPM  1  17302932 226398194 300258
## 
## Step:  AIC=294168.7
## PM2.5 ~ WSPM + TEMP + DEWP
## 
##        Df Sum of Sq       RSS    AIC
## + PRES  1   1635802 187735340 293875
## + RAIN  1    386391 188984752 294101
## <none>              189371143 294169
## - WSPM  1    515672 189886814 294259
## - DEWP  1  19724119 209095262 297547
## - TEMP  1  22690737 212061880 298028
## 
## Step:  AIC=293874.7
## PM2.5 ~ WSPM + TEMP + DEWP + PRES
## 
##        Df Sum of Sq       RSS    AIC
## + RAIN  1    434995 187300346 293798
## <none>              187735340 293875
## - WSPM  1    498578 188233919 293963
## - PRES  1   1635802 189371143 294169
## - DEWP  1  16051112 203786452 296672
## - TEMP  1  23149487 210884828 297840
## 
## Step:  AIC=293797.5
## PM2.5 ~ WSPM + TEMP + DEWP + PRES + RAIN
## 
##        Df Sum of Sq       RSS    AIC
## <none>              187300346 293798
## - WSPM  1    411443 187711788 293870
## - RAIN  1    434995 187735340 293875
## - PRES  1   1684407 188984752 294101
## - DEWP  1  16450903 203751248 296668
## - TEMP  1  23549408 210849754 297836
```

```
## 
## Call:
## lm(formula = PM2.5 ~ WSPM + TEMP + DEWP + PRES + RAIN, data = beijing)
## 
## Coefficients:
## (Intercept)         WSPM         TEMP         DEWP         PRES         RAIN  
##    1426.074       -3.611       -5.449        3.641       -1.259       -3.918
```

---

### Stepwise methods

Potential issues with stepwise methods:

- Algorithms are greedy - no guarantee the "best" model will be found
- Often doesn't work well with highly correlated variables
- Order often matters and results in different final models chosen
- "Only uses the data" without thought for existing scientific knowledge

---

### LASSO

**LASSO**, or least absolute shrinkage and **selection** operator, is a 
regression technique that also happens to perform variable selection.

Instead of minimizing the residual sum of squares like in OLS:

`$$\min_\boldsymbol\beta \sum_{i = 1}^n (y_i - \mathbf{x}^T_i\boldsymbol\beta)^2$$`

LASSO minimizes the following:

`$$\min_\boldsymbol\beta \left(\sum_{i = 1}^n (y_i - \mathbf{x}^T_i\boldsymbol\beta)^2 + \lambda \sum_{k = 1}^p |\beta_p| \right)$$`

Because of this additional "penalty," some of the `\(\beta_k\)` terms can be set to
zero, thus performing variable selection. Often `\(\lambda\)` is chosen to optimize
some criterion (e.g., predictive performance, perhaps using cross-validation)

---

### LASSO

```r
y <- beijing$PM2.5
x <- model.matrix(PM2.5 ~ TEMP + PRES + DEWP + RAIN + WSPM,
                  data = beijing)

library(glmnet) # (has LASSO-related functions here)

m_lasso_cv <- cv.glmnet(x, y, alpha = 1)
```

In the above code, we're performing k-fold cross validation to choose the "best"
`\(\lambda\)` value (in this case, by minimizing MSE in the test sets from the
cross-validation).

The argument `alpha = 1` corresponds to LASSO.

---

### LASSO

```r
best_lambda <- m_lasso_cv$lambda.min
best_lambda
```

```
## [1] 0.04932974
```

---

### LASSO

```r
plot(m_lasso_cv)
```

![](variable-selection_files/figure-html/unnamed-chunk-16-1.png)

---

### LASSO

```r
m_best <- glmnet(x, y, alpha = 1, lambda = best_lambda)
m_best$beta
```

```
## 6 x 1 sparse Matrix of class "dgCMatrix"
##                    s0
## (Intercept)  .       
## TEMP        -5.407856
## PRES        -1.241963
## DEWP         3.617576
## RAIN        -3.840631
## WSPM        -3.679301
```

If LASSO ended up **shrinking** any coefficient estimates to zero, then
we wouldn't see them here. In this case, LASSO ended up choosing to keep all
variables, much like the other examples we've seen.

Note also that there is no intercept term estimated here - there is some
pre-processing of the variables to center them (mean zero) and so the 
intercept would be set to zero.

---

### Categorical predictors

For categorical variables, LASSO considers them "as a group" - an all or nothing
sort of deal. The stepwise functions as implemented in R also treat categorical
predictors as a group. However, the best subset selection function shown today
treats each of the dummy variables separately.

---

### A last word

**Use your domain knowledge**. Don't blindly use these sorts of "automatic"
variable selection techniques (unless you specifically are interested in
prediction and *only* prediction).

- Look at your data!
- Don't forget about interpretations (especially if transformations are involved)
- Write down what your model actually represents
- **Use your domain knowledge** (I can't stress this enough)

**Use your domain knowledge.**