Model selection


Setup

Back to the diamonds (simplified)

Load packages:

library(ggplot2)
library(dplyr)
library(forcats)

We will simplify the data slightly:

set.seed(100)
dmdsimple = diamonds %>%
    sample_n(1000) %>%
    mutate(color = fct_collapse(color, 
                                colorless  = c("D","E","F"),
                                near_colorless = c("G","H","I","J")) %>%
                   factor(ordered=FALSE),
           clarity = fct_collapse(clarity, 
                                  SI = c("SI1","SI2"),
                                  VS = c("VS1","VS2"),
                                  VVS1 = c("VVS1","VVS2")) %>%
                     factor(ordered=FALSE))

Quality of fit in MLR

\(R^2\)

  • \(R^2\) is the percentage of variability in the response variable explained by the regression model.
summary( lm(sqrt(price) ~ carat, data = dmdsimple) )$r.squared
## [1] 0.8841681
summary( lm(sqrt(price) ~ carat + table, data = dmdsimple) )$r.squared
## [1] 0.8850684
summary( lm(sqrt(price) ~ carat * table, data = dmdsimple) )$r.squared
## [1] 0.8883356
  • The model with interactions has a highest \(R^2\), but it appears to only be marginally better

Any notice a pattern?

summary( lm(sqrt(price)~carat, data = dmdsimple) )$r.squared
## [1] 0.8841681
summary( lm(sqrt(price)~carat+table, data = dmdsimple) )$r.squared
## [1] 0.8850684
summary( lm(sqrt(price)~carat+table+depth, data = dmdsimple) )$r.squared
## [1] 0.8857727
summary( lm(sqrt(price)~carat+table+depth+color, data = dmdsimple) )$r.squared
## [1] 0.8921077

\(R^2 \ne\) Parsimonious models

\(R^2\) is not a good criteria for model selection since it will always increases (or at worst stays the same) when a variable is added to the model.

\(R^2\) from first principles

\[ R^2 = \frac{SS_{\text{Model}}}{SS_{\text{Total}}} = 1 - \frac{ SS_{\text{Error}} }{ SS_{\text{Total}} } \]

where

\[ \begin{aligned} SS_{\text{Model}} &= \sum_{i=1}^n (\hat{y}_i-\bar{y})^2 \\ SS_{\text{Error}} &= \sum_{i=1}^n (y_i-\hat{y}_i)^2 \\ SS_{\text{Total}} &= \sum_{i=1}^n (y_i-\bar{y})^2 \end{aligned} \]

\[SS_{\text{Total}} = SS_{\text{Model}} + SS_{\text{Error}}\]

ANOVA (Sum of Squares) Table

anova( lm(sqrt(price)~carat+table, data = dmdsimple) )
## Analysis of Variance Table
## 
## Response: sqrt(price)
##            Df Sum Sq Mean Sq   F value    Pr(>F)    
## carat       1 729360  729360 7669.9154 < 2.2e-16 ***
## table       1    743     743    7.8098  0.005296 ** 
## Residuals 997  94808      95                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\[ SS_{\text{Model}} = SS_{\text{Model}} + SS_{\text{Model}} = 729360 + 743 = 730103 \]

\[ SS_{\text{Error}} = 94808 \qquad SS_{\text{Total}} = 824911 \]

Sum of Squares and \(R^2\)

summary( lm(sqrt(price)~carat+table, data = dmdsimple) )$r.squared
## [1] 0.8850684

\[ SS_{\text{Model}} = 730103 \qquad SS_{\text{Error}} = 94808 \qquad SS_{\text{Total}} = 824911 \]


\[ R^2 = 1 - \frac{SS_{\text{Error}}}{SS_{\text{Total}}} = 1 - \frac{94808}{824911} = 0.885 \]

Summary

  • All \(SS > 0\) since they are sums of squared values

  • Since \(SS_{\text{Total}} = SS_{\text{Model}} + SS_{\text{Error}}\) increasing \(SS_{\text{model}}\) necessarily decreases \(SS_{\text{Error}}\), and vice-versa.

  • Adding a variable will increase \(SS_{\text{Model}}\) by \(SS_{\text{new var}}\), which will decrease \[SS_{\text{Error}}\] by the same amount and therefore increase \(R^2\).

  • \(R^2\) will always prefer a more complicated model

Adjusted \(R^2\)

Adjusted \(R^2\)

\[ R^2_{adj} = 1 - \frac{ SS_{\text{Error}} }{ SS_{\text{Total}} } \times \frac{n - 1}{n - k - 1} \] where \(n\) is the number of cases and \(k\) is the number of slope coefficients in the model

  • Adjusted \(R^2\) is like \(R^2\) with the addition of a penalty for the addition of each new variable
  • This makes adjusted \(R^2\) a better and more reasonable metric for model selection

Adjusted \(R^2\) Example

summary( lm(sqrt(price)~carat, data = dmdsimple) )$adj.r.squared
## [1] 0.8840521
summary( lm(sqrt(price)~carat+table, data = dmdsimple) )$adj.r.squared
## [1] 0.8848379
summary( lm(sqrt(price)~carat+table+depth, data = dmdsimple) )$adj.r.squared
## [1] 0.8854287
summary( lm(sqrt(price)~carat+table+depth+color, data = dmdsimple) )$adj.r.squared
## [1] 0.8916739

ANOVA (Sum of Squares) Table (again)

anova( lm(sqrt(price)~carat+table+depth+color, data = dmdsimple) )
## Analysis of Variance Table
## 
## Response: sqrt(price)
##            Df Sum Sq Mean Sq   F value    Pr(>F)    
## carat       1 729360  729360 8153.9361 < 2.2e-16 ***
## table       1    743     743    8.3026  0.004044 ** 
## depth       1    581     581    6.4955  0.010964 *  
## color       1   5226    5226   58.4216 4.968e-14 ***
## Residuals 995  89002      89                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\[ SS_{\text{Model}} = 729360 + 743 + 581 + 5226 = 735910 \]

\[ SS_{\text{Error}} = 89002 \qquad SS_{\text{Total}} = 824911 \]

Summary

lm(sqrt(price)~carat+table+depth+color, data = dmdsimple) %>% summary()
## 
## Call:
## lm(formula = sqrt(price) ~ carat + table + depth + color, data = dmdsimple)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.258  -4.364  -1.101   3.082  45.000 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          68.6824    17.4457   3.937 8.83e-05 ***
## carat                58.4557     0.6560  89.115  < 2e-16 ***
## table                -0.4772     0.1379  -3.462 0.000559 ***
## depth                -0.4800     0.2142  -2.241 0.025238 *  
## colornear_colorless  -4.6640     0.6102  -7.643 4.97e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.458 on 995 degrees of freedom
## Multiple R-squared:  0.8921, Adjusted R-squared:  0.8917 
## F-statistic:  2057 on 4 and 995 DF,  p-value: < 2.2e-16

Calculating Adjusted \(R^2\)

\[ \begin{aligned} \text{Adj }R^2 &= 1 - \frac{ SS_{\text{Error}} }{ SS_{\text{Total}} } \times \frac{n - 1}{n - k - 1} \\ &= 1 - \frac{89002}{824911} \times \frac{1000 - 1}{1000 - 4 - 1} \\ &= 0.8917 \end{aligned} \]

Stepwise model selection

Model selection criteria

  • Adjusted \(R^2\) is just one possible model selection criterion

  • There are others out there (many many others!), we’ll discuss some later in the course, and some you might see in another courses

  • If an interaction is included in the model, the main effects of both interacting variables should also be in the model

Backwards elimination

  • Start with full model (including all candidate explanatory variables and all candidate interactions)
  • Remove one variable at a time, and select the model with the highest adjusted \(R^2\)
  • Continue until adjusted \(R^2\) does not increase

Forward selection

  • Start with empty model
  • Add one variable (or interaction effect) at a time, and select the model with the highest adjusted \(R^2\)
  • Continue until adjusted \(R^2\) does not increase

Demo