Setup

Back to the diamonds (simplified)

Load packages:

library(ggplot2)
library(dplyr)
library(forcats)

We will simplify the data slightly:

set.seed(100)
dmdsimple = diamonds %>%
    sample_n(1000) %>%
    mutate(color = fct_collapse(color, 
                                colorless  = c("D","E","F"),
                                near_colorless = c("G","H","I","J")) %>%
                   factor(ordered=FALSE),
           clarity = fct_collapse(clarity, 
                                  SI = c("SI1","SI2"),
                                  VS = c("VS1","VS2"),
                                  VVS1 = c("VVS1","VVS2")) %>%
                     factor(ordered=FALSE))

Quality of fit in MLR

\(R^2\)

\(R^2\) is the percentage of variability in the response variable explained by the regression model.

summary( lm(sqrt(price) ~ carat, data = dmdsimple) )$r.squared

## [1] 0.8841681

summary( lm(sqrt(price) ~ carat + table, data = dmdsimple) )$r.squared

## [1] 0.8850684

summary( lm(sqrt(price) ~ carat * table, data = dmdsimple) )$r.squared

## [1] 0.8883356

The model with interactions has a highest \(R^2\), but it appears to only be marginally better

Any notice a pattern?

summary( lm(sqrt(price)~carat, data = dmdsimple) )$r.squared

## [1] 0.8841681

summary( lm(sqrt(price)~carat+table, data = dmdsimple) )$r.squared

## [1] 0.8850684

summary( lm(sqrt(price)~carat+table+depth, data = dmdsimple) )$r.squared

## [1] 0.8857727

summary( lm(sqrt(price)~carat+table+depth+color, data = dmdsimple) )$r.squared

## [1] 0.8921077

\(R^2 \ne\) Parsimonious models

\(R^2\) is not a good criteria for model selection since it will always increases (or at worst stays the same) when a variable is added to the model.

\(R^2\) from first principles

\[ R^2 = \frac{SS_{\text{Model}}}{SS_{\text{Total}}} = 1 - \frac{ SS_{\text{Error}} }{ SS_{\text{Total}} } \]

where

\[ \begin{aligned} SS_{\text{Model}} &= \sum_{i=1}^n (\hat{y}_i-\bar{y})^2 \\ SS_{\text{Error}} &= \sum_{i=1}^n (y_i-\hat{y}_i)^2 \\ SS_{\text{Total}} &= \sum_{i=1}^n (y_i-\bar{y})^2 \end{aligned} \]

\[SS_{\text{Total}} = SS_{\text{Model}} + SS_{\text{Error}}\]

ANOVA (Sum of Squares) Table

anova( lm(sqrt(price)~carat+table, data = dmdsimple) )

## Analysis of Variance Table
## 
## Response: sqrt(price)
##            Df Sum Sq Mean Sq   F value    Pr(>F)    
## carat       1 729360  729360 7669.9154 < 2.2e-16 ***
## table       1    743     743    7.8098  0.005296 ** 
## Residuals 997  94808      95                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\[ SS_{\text{Model}} = SS_{\text{Model}} + SS_{\text{Model}} = 729360 + 743 = 730103 \]

\[ SS_{\text{Error}} = 94808 \qquad SS_{\text{Total}} = 824911 \]

Sum of Squares and \(R^2\)

summary( lm(sqrt(price)~carat+table, data = dmdsimple) )$r.squared

## [1] 0.8850684

\[ SS_{\text{Model}} = 730103 \qquad SS_{\text{Error}} = 94808 \qquad SS_{\text{Total}} = 824911 \]

\[ R^2 = 1 - \frac{SS_{\text{Error}}}{SS_{\text{Total}}} = 1 - \frac{94808}{824911} = 0.885 \]

Summary

All \(SS > 0\) since they are sums of squared values
Since \(SS_{\text{Total}} = SS_{\text{Model}} + SS_{\text{Error}}\) increasing \(SS_{\text{model}}\) necessarily decreases \(SS_{\text{Error}}\), and vice-versa.
Adding a variable will increase \(SS_{\text{Model}}\) by \(SS_{\text{new var}}\), which will decrease \[SS_{\text{Error}}\] by the same amount and therefore increase \(R^2\).
\(R^2\) will always prefer a more complicated model

Adjusted \(R^2\)

\[ R^2_{adj} = 1 - \frac{ SS_{\text{Error}} }{ SS_{\text{Total}} } \times \frac{n - 1}{n - k - 1} \] where \(n\) is the number of cases and \(k\) is the number of slope coefficients in the model

Adjusted \(R^2\) is like \(R^2\) with the addition of a penalty for the addition of each new variable

This makes adjusted \(R^2\) a better and more reasonable metric for model selection

Adjusted \(R^2\) Example

summary( lm(sqrt(price)~carat, data = dmdsimple) )$adj.r.squared

## [1] 0.8840521

summary( lm(sqrt(price)~carat+table, data = dmdsimple) )$adj.r.squared

## [1] 0.8848379

summary( lm(sqrt(price)~carat+table+depth, data = dmdsimple) )$adj.r.squared

## [1] 0.8854287

summary( lm(sqrt(price)~carat+table+depth+color, data = dmdsimple) )$adj.r.squared

## [1] 0.8916739

ANOVA (Sum of Squares) Table (again)

anova( lm(sqrt(price)~carat+table+depth+color, data = dmdsimple) )

## Analysis of Variance Table
## 
## Response: sqrt(price)
##            Df Sum Sq Mean Sq   F value    Pr(>F)    
## carat       1 729360  729360 8153.9361 < 2.2e-16 ***
## table       1    743     743    8.3026  0.004044 ** 
## depth       1    581     581    6.4955  0.010964 *  
## color       1   5226    5226   58.4216 4.968e-14 ***
## Residuals 995  89002      89                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\[ SS_{\text{Model}} = 729360 + 743 + 581 + 5226 = 735910 \]

\[ SS_{\text{Error}} = 89002 \qquad SS_{\text{Total}} = 824911 \]

Summary

lm(sqrt(price)~carat+table+depth+color, data = dmdsimple) %>% summary()

## 
## Call:
## lm(formula = sqrt(price) ~ carat + table + depth + color, data = dmdsimple)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.258  -4.364  -1.101   3.082  45.000 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          68.6824    17.4457   3.937 8.83e-05 ***
## carat                58.4557     0.6560  89.115  < 2e-16 ***
## table                -0.4772     0.1379  -3.462 0.000559 ***
## depth                -0.4800     0.2142  -2.241 0.025238 *  
## colornear_colorless  -4.6640     0.6102  -7.643 4.97e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.458 on 995 degrees of freedom
## Multiple R-squared:  0.8921, Adjusted R-squared:  0.8917 
## F-statistic:  2057 on 4 and 995 DF,  p-value: < 2.2e-16

Calculating Adjusted \(R^2\)

\[ \begin{aligned} \text{Adj }R^2 &= 1 - \frac{ SS_{\text{Error}} }{ SS_{\text{Total}} } \times \frac{n - 1}{n - k - 1} \\ &= 1 - \frac{89002}{824911} \times \frac{1000 - 1}{1000 - 4 - 1} \\ &= 0.8917 \end{aligned} \]

Stepwise model selection

Model selection criteria

Adjusted \(R^2\) is just one possible model selection criterion
There are others out there (many many others!), we’ll discuss some later in the course, and some you might see in another courses
If an interaction is included in the model, the main effects of both interacting variables should also be in the model

Backwards elimination

Start with full model (including all candidate explanatory variables and all candidate interactions)

Remove one variable at a time, and select the model with the highest adjusted \(R^2\)

Continue until adjusted \(R^2\) does not increase

Forward selection

Start with empty model

Add one variable (or interaction effect) at a time, and select the model with the highest adjusted \(R^2\)

Continue until adjusted \(R^2\) does not increase

Model selection

Setup

Back to the diamonds (simplified)

Quality of fit in MLR

\(R^2\)

Any notice a pattern?

\(R^2 \ne\) Parsimonious models

\(R^2\) from first principles

ANOVA (Sum of Squares) Table

Sum of Squares and \(R^2\)

Summary

Adjusted \(R^2\)

Adjusted \(R^2\)

Adjusted \(R^2\) Example

ANOVA (Sum of Squares) Table (again)

Summary

Calculating Adjusted \(R^2\)

Stepwise model selection

Model selection criteria

Backwards elimination

Forward selection

Demo