Load packages:
library(ggplot2)
library(dplyr)
library(forcats)
We will simplify the data slightly:
set.seed(100)
dmdsimple = diamonds %>%
sample_n(1000) %>%
mutate(color = fct_collapse(color,
colorless = c("D","E","F"),
near_colorless = c("G","H","I","J")) %>%
factor(ordered=FALSE),
clarity = fct_collapse(clarity,
SI = c("SI1","SI2"),
VS = c("VS1","VS2"),
VVS1 = c("VVS1","VVS2")) %>%
factor(ordered=FALSE))
- \(R^2\) is the percentage of variability in the response variable explained by the regression model.
summary( lm(sqrt(price) ~ carat, data = dmdsimple) )$r.squared
## [1] 0.8841681
summary( lm(sqrt(price) ~ carat + table, data = dmdsimple) )$r.squared
## [1] 0.8850684
summary( lm(sqrt(price) ~ carat * table, data = dmdsimple) )$r.squared
## [1] 0.8883356
- The model with interactions has a highest \(R^2\), but it appears to only be marginally better
summary( lm(sqrt(price)~carat, data = dmdsimple) )$r.squared
## [1] 0.8841681
summary( lm(sqrt(price)~carat+table, data = dmdsimple) )$r.squared
## [1] 0.8850684
summary( lm(sqrt(price)~carat+table+depth, data = dmdsimple) )$r.squared
## [1] 0.8857727
summary( lm(sqrt(price)~carat+table+depth+color, data = dmdsimple) )$r.squared
## [1] 0.8921077
\(R^2\) is not a good criteria for model selection since it will always increases (or at worst stays the same) when a variable is added to the model.
\[ R^2 = \frac{SS_{\text{Model}}}{SS_{\text{Total}}} = 1 - \frac{ SS_{\text{Error}} }{ SS_{\text{Total}} } \]
where
\[ \begin{aligned} SS_{\text{Model}} &= \sum_{i=1}^n (\hat{y}_i-\bar{y})^2 \\ SS_{\text{Error}} &= \sum_{i=1}^n (y_i-\hat{y}_i)^2 \\ SS_{\text{Total}} &= \sum_{i=1}^n (y_i-\bar{y})^2 \end{aligned} \]
\[SS_{\text{Total}} = SS_{\text{Model}} + SS_{\text{Error}}\]
anova( lm(sqrt(price)~carat+table, data = dmdsimple) )
## Analysis of Variance Table
##
## Response: sqrt(price)
## Df Sum Sq Mean Sq F value Pr(>F)
## carat 1 729360 729360 7669.9154 < 2.2e-16 ***
## table 1 743 743 7.8098 0.005296 **
## Residuals 997 94808 95
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[ SS_{\text{Model}} = SS_{\text{Model}} + SS_{\text{Model}} = 729360 + 743 = 730103 \]
\[ SS_{\text{Error}} = 94808 \qquad SS_{\text{Total}} = 824911 \]
summary( lm(sqrt(price)~carat+table, data = dmdsimple) )$r.squared
## [1] 0.8850684
\[ SS_{\text{Model}} = 730103 \qquad SS_{\text{Error}} = 94808 \qquad SS_{\text{Total}} = 824911 \]
\[ R^2 = 1 - \frac{SS_{\text{Error}}}{SS_{\text{Total}}} = 1 - \frac{94808}{824911} = 0.885 \]
All \(SS > 0\) since they are sums of squared values
Since \(SS_{\text{Total}} = SS_{\text{Model}} + SS_{\text{Error}}\) increasing \(SS_{\text{model}}\) necessarily decreases \(SS_{\text{Error}}\), and vice-versa.
Adding a variable will increase \(SS_{\text{Model}}\) by \(SS_{\text{new var}}\), which will decrease \[SS_{\text{Error}}\] by the same amount and therefore increase \(R^2\).
\(R^2\) will always prefer a more complicated model
\[ R^2_{adj} = 1 - \frac{ SS_{\text{Error}} }{ SS_{\text{Total}} } \times \frac{n - 1}{n - k - 1} \] where \(n\) is the number of cases and \(k\) is the number of slope coefficients in the model
- Adjusted \(R^2\) is like \(R^2\) with the addition of a penalty for the addition of each new variable
- This makes adjusted \(R^2\) a better and more reasonable metric for model selection
summary( lm(sqrt(price)~carat, data = dmdsimple) )$adj.r.squared
## [1] 0.8840521
summary( lm(sqrt(price)~carat+table, data = dmdsimple) )$adj.r.squared
## [1] 0.8848379
summary( lm(sqrt(price)~carat+table+depth, data = dmdsimple) )$adj.r.squared
## [1] 0.8854287
summary( lm(sqrt(price)~carat+table+depth+color, data = dmdsimple) )$adj.r.squared
## [1] 0.8916739
anova( lm(sqrt(price)~carat+table+depth+color, data = dmdsimple) )
## Analysis of Variance Table
##
## Response: sqrt(price)
## Df Sum Sq Mean Sq F value Pr(>F)
## carat 1 729360 729360 8153.9361 < 2.2e-16 ***
## table 1 743 743 8.3026 0.004044 **
## depth 1 581 581 6.4955 0.010964 *
## color 1 5226 5226 58.4216 4.968e-14 ***
## Residuals 995 89002 89
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[ SS_{\text{Model}} = 729360 + 743 + 581 + 5226 = 735910 \]
\[ SS_{\text{Error}} = 89002 \qquad SS_{\text{Total}} = 824911 \]
lm(sqrt(price)~carat+table+depth+color, data = dmdsimple) %>% summary()
##
## Call:
## lm(formula = sqrt(price) ~ carat + table + depth + color, data = dmdsimple)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77.258 -4.364 -1.101 3.082 45.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.6824 17.4457 3.937 8.83e-05 ***
## carat 58.4557 0.6560 89.115 < 2e-16 ***
## table -0.4772 0.1379 -3.462 0.000559 ***
## depth -0.4800 0.2142 -2.241 0.025238 *
## colornear_colorless -4.6640 0.6102 -7.643 4.97e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.458 on 995 degrees of freedom
## Multiple R-squared: 0.8921, Adjusted R-squared: 0.8917
## F-statistic: 2057 on 4 and 995 DF, p-value: < 2.2e-16
\[ \begin{aligned} \text{Adj }R^2 &= 1 - \frac{ SS_{\text{Error}} }{ SS_{\text{Total}} } \times \frac{n - 1}{n - k - 1} \\ &= 1 - \frac{89002}{824911} \times \frac{1000 - 1}{1000 - 4 - 1} \\ &= 0.8917 \end{aligned} \]
Adjusted \(R^2\) is just one possible model selection criterion
There are others out there (many many others!), we’ll discuss some later in the course, and some you might see in another courses
If an interaction is included in the model, the main effects of both interacting variables should also be in the model
- Start with full model (including all candidate explanatory variables and all candidate interactions)
- Remove one variable at a time, and select the model with the highest adjusted \(R^2\)
- Continue until adjusted \(R^2\) does not increase
- Start with empty model
- Add one variable (or interaction effect) at a time, and select the model with the highest adjusted \(R^2\)
- Continue until adjusted \(R^2\) does not increase