Model selection 👌

# Model selection <br> 👌

---

layout: true
  
<div class="my-footer">
<span>
Dr. Mine Çetinkaya-Rundel -
<a href="http://www2.stat.duke.edu/courses/Fall18/sta112.01/schedule" target="_blank">stat.duke.edu/courses/Fall18/sta112.01
</a>
</span>
</div>

---

## Announcements

- Team meetings: Find a time when all team members can come meet with me for 20 minutes, ideally during my office hours, or suggest a different time on Slack.
- Office hours:
  - Thursday (today): 2:30 - 4:30pm
  - Monday: 11:30 - 1:30pm (no Thursday OH next week)
- Changed due date for project proposal
  - Check out the Data and Visualization Library services for help with finding data for your projects: [library.duke.edu/data/](https://library.duke.edu/data/)
  - You will be in new teams for your projects
- New team HW due Tuesday

---

# Quality of fit in MLR

---

## From last time...

Load data:

```r
pp <- read_csv("data/paris_paintings.csv", na = c("n/a", "", "NA"))
```

--
Filter for paintings < 5000 square inches:

```r
pp_Surf_lt_5000 <- pp %>% filter(Surface < 5000)
```

--
Model with main effects only:

```r
m_main <- lm(log(price) ~ Surface + factor(artistliving), 
             data = pp_Surf_lt_5000)
```

--
Model with interaction effects:

```r
m_int <- lm(log(price) ~ Surface + factor(artistliving) + 
              Surface * factor(artistliving), 
            data = pp_Surf_lt_5000)
```

---

## R-squared

- `$R^2$` is the percentage of variability in the response variable explained by the 
regression model.

```r
glance(m_main)$r.squared
```

```
## [1] 0.01320884
```

```r
glance(m_int)$r.squared
```

```
## [1] 0.0176922
```

--
- Clearly the model with interactions has a higher `$R^2$`.

--
- However using `$R^2$` for model selection in models with multiple explanatory 
variables is not a good idea as `$R^2$` increases when **any** variable is added 
to the model.

---

## R-squared - first principles

$$ R^2 =  \frac{ SS\_{Reg} }{ SS\_{Total} } = 1 - \left( \frac{ SS\_{Error} }{ SS\_{Total} } \right) $$

```r
anova(m_main)
```

```
## Analysis of Variance Table
## 
## Response: log(price)
##                        Df  Sum Sq Mean Sq F value    Pr(>F)
## Surface                 1   138.5 138.537 40.6741 2.058e-10
## factor(artistliving)    1     6.8   6.810  1.9994    0.1575
## Residuals            3188 10858.4   3.406
```

---

## Adjusted R-squared

$$ R^2\_{adj} = 1 - \left( \frac{ SS\_{Error} }{ SS\_{Total} } \times \frac{n - 1}{n - k - 1} \right), $$

where `$n$` is the number of cases and `$k$` is the number of predictors in the model

--
- Adjusted `$R^2$` doesn't increase if the new variable does not provide any new 
informaton or is completely unrelated.

--
- This makes adjusted `$R^2$` a preferable metric for model selection in multiple
regression models.

---

## In pursuit of Occam's Razor

- Occam's Razor states that among competing hypotheses that predict equally well, 
the one with the fewest assumptions should be selected.
- Model selection follows this principle.
- We only want to add another variable to the model if the addition of that
variable brings something valuable in terms of predictive power to the model.
- In other words, we prefer the simplest best model, i.e. **parsimonious** model.

---

## Comparing models

It appears that adding the interaction actually increased adjusted `$R^2$`, so we 
should indeed use the model with the interactions.

```r
glance(m_main)$adj.r.squared
```

```
## [1] 0.01258977
```

```r
glance(m_int)$adj.r.squared
```

```
## [1] 0.01676753
```

---

# Model selection

---

## Backwards elimination

- Start with **full** model (including all candidate explanatory variables and 
all candidate interactions)
- Remove one variable at a time, and select the model with the highest adjusted 
`$R^2$`
- Continue until adjusted `$R^2$` does not increase

---

## Forward selection

- Start with **empty** model
- Add one variable (or interaction effect) at a time, and select the model 
with the highest adjusted `$R^2$`
- Continue until adjusted `$R^2$` does not increase

---

## Model selection and interaction effects

Model selection for models including interaction effects must follow the 
following two principles:

- If an interaction is included in the model, the main effects of both of 
those variables must also be in the model.
- If a main effect is not in the model, then its interaction should not be 
in the model.

---

## Other model selection criteria

- Adjusted `$R^2$` is one model selection criterion
- There are others out there (many many others!), we'll discuss some later in 
the course, and some you might see in other courses

---

## Exploration 2

### Price, surface area, material, shape

```r
m_sms <- lm(log(price) ~ Surface + mat + Shape, data = pp)
tidy(m_sms) %>%
  print(n = 25)
```

```
## # A tibble: 25 x 5
##    term           estimate std.error statistic  p.value
##    <chr>             <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)    3.97     2.23         1.78   7.58e- 2
##  2 Surface        0.000349 0.0000379    9.21   5.76e-20
##  3 matal          0.860    2.23         0.385  7.00e- 1
##  4 matar         -1.27     2.23        -0.567  5.71e- 1
##  5 matb           1.59     1.30         1.22   2.21e- 1
##  6 matbr          0.982    1.48         0.664  5.07e- 1
##  7 matc           1.81     1.30         1.39   1.66e- 1
##  8 matca         -0.538    1.83        -0.295  7.68e- 1
##  9 matco         -0.342    1.50        -0.228  8.20e- 1
## 10 mate           0.996    2.23         0.446  6.56e- 1
## 11 matg           0.603    1.67         0.361  7.18e- 1
## 12 math          -1.04     1.58        -0.657  5.11e- 1
## 13 matm           3.24     2.26         1.43   1.52e- 1
## 14 matmi          3.44     2.25         1.53   1.26e- 1
## 15 mato           2.43     1.58         1.54   1.25e- 1
## 16 matp           0.782    1.34         0.583  5.60e- 1
## 17 matpa          2.95     1.50         1.97   4.87e- 2
## 18 matt           0.885    1.30         0.681  4.96e- 1
## 19 matta          0.521    1.31         0.396  6.92e- 1
## 20 matv           0.531    1.41         0.376  7.07e- 1
## 21 Shapeoval     -0.798    1.85        -0.430  6.67e- 1
## 22 Shapeovale    -0.272    1.86        -0.146  8.84e- 1
## 23 Shaperonde    -0.170    1.99        -0.0856 9.32e- 1
## 24 Shaperound    -1.70     1.83        -0.929  3.53e- 1
## 25 Shapesqu_rect -0.226    1.82        -0.124  9.01e- 1
```
]

---

## Shape and material

Collapse levels of `Shape` and `mat`erial variables with `forcats::fct_collapse`:

```r
pp <- pp %>%
  mutate(
    Shape = fct_collapse(Shape, oval = c("oval", "ovale"),
                                round = c("round", "ronde"),
                                squ_rect = "squ_rect",
                                other = c("octogon", "octagon", "miniature")),
    mat = fct_collapse(mat, metal = c("a", "br", "c"),
                            canvas = c("co", "t", "ta"),
                            paper = c("p", "ca"),
                            wood = "b",
                            other = c("e", "g", "h", "mi", "o", "pa", "v", "al", "ar", "m"))
  )
```
]

---

## Review fixes

```r
pp %>%
  count(Shape)
```

```
## # A tibble: 5 x 2
##   Shape        n
##   <fct>    <int>
## 1 other       12
## 2 oval        52
## 3 round       74
## 4 squ_rect  3219
## 5 <NA>        36
```
]

```r
pp %>%
  count(mat)
```

```
## # A tibble: 6 x 2
##   mat        n
##   <fct>  <int>
## 1 metal    321
## 2 other     59
## 3 wood     886
## 4 paper     38
## 5 canvas  1783
## 6 <NA>     306
```
]

---

## Refit model

.question[
Interpret the slope for `matother`. *Hint:* What is the baseline level for the 
material variable?
]

```r
m_sms <- lm(log(price) ~ Surface + mat + Shape, data = pp)
tidy(m_sms)
```

```
## # A tibble: 9 x 5
##   term           estimate std.error statistic  p.value
##   <chr>             <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    5.75     1.83          3.15  1.66e- 3
## 2 Surface        0.000344 0.0000377     9.14  1.17e-19
## 3 matother      -0.643    0.333        -1.93  5.38e- 2
## 4 matwood       -0.189    0.120        -1.58  1.14e- 1
## 5 matpaper      -1.10     0.343        -3.21  1.32e- 3
## 6 matcanvas     -0.916    0.115        -7.95  2.52e-15
## 7 Shapeoval     -0.519    1.84         -0.282 7.78e- 1
## 8 Shaperound    -1.62     1.84         -0.884 3.77e- 1
## 9 Shapesqu_rect -0.223    1.82         -0.123 9.02e- 1
```

---

## Change the baseline level

Change order of levels of the `mat` variable with `forcats::fct_relevel`:

```r
pp <- pp %>%
  mutate(mat = fct_relevel(mat, "other"))
levels(pp$mat)
```

```
## [1] "other"  "metal"  "wood"   "paper"  "canvas"
```

---

## Refit model

.question[
Interpret the slope for `matmetal`. *Hint:* What is the baseline level for the 
material variable?
]

```r
m_sms <- lm(log(price) ~ Surface + mat + Shape, data = pp)
tidy(m_sms)
```

```
## # A tibble: 9 x 5
##   term           estimate std.error statistic  p.value
##   <chr>             <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    5.10     1.85          2.76  5.84e- 3
## 2 Surface        0.000344 0.0000377     9.14  1.17e-19
## 3 matmetal       0.643    0.333         1.93  5.38e- 2
## 4 matwood        0.454    0.324         1.40  1.61e- 1
## 5 matpaper      -0.460    0.457        -1.01  3.14e- 1
## 6 matcanvas     -0.273    0.322        -0.845 3.98e- 1
## 7 Shapeoval     -0.519    1.84         -0.282 7.78e- 1
## 8 Shaperound    -1.62     1.84         -0.884 3.77e- 1
## 9 Shapesqu_rect -0.223    1.82         -0.123 9.02e- 1
```

---

## Application exercise:

### `ae-10-model-selection-pp`

- First, plan the full model
- Then, fit, evaluate, select
- Lastly, interpret the final model