Model fit and model selection

# Model fit and model selection
## Intro to Data Science
### Shawn Santo
### 02-27-20

---

## Announcements

- Homework 3 assigned today (check website)

- Lab 5 due Friday, Feb 28 at 11:59pm

- I'll post a reading to be completed for next lecture.

---

# Recall

---

## Multiple linear regression

- We want to investigate how a collection of predictors are associated with
  a response.
  
- Our objective is to find the best fit regression surface, as opposed to
  the best fit line when we have only one predictor.
  
- To assess linearity we can no longer use an `x` v. `y` scatter plot. We'll
  need to use residual plots.
  
- Interpretation changes slightly. For interpreting a single coefficient, we
  hold all others constant.

---

## Interacting explanatory variables

- Including an interaction effect in the model allows for different slopes, i.e. 
  nonparallel lines.

- This implies that the regression coefficient for an explanatory variable would 
  change as another explanatory variable changes.

- This can be accomplished by adding an 
  **interaction variable**: the product of two explanatory variables.

---

## Load data; create models

Reminder: we're trying to create a model to predict log price based on surface
area and whether the artist is still alive.

```r
library(tidyverse)
library(broom)

paris_paintings <- read_csv("data/paris_paintings.csv", 
                            na = c("n/a", "", "NA"))

pp_surf_5000 <- paris_paintings %>%
  filter(Surface < 5000)

m_main <- lm(log(price) ~ Surface + factor(artistliving), 
             data = pp_surf_5000)

m_int <- lm(log(price) ~ Surface * factor(artistliving), 
            data = pp_surf_5000)
```

---

# Quality of fit in MLR

---

## `$R^2$`

- `$R^2$` is the percentage of variability in the response variable 
  explained by the regression model.

```r
glance(m_main) %>% 
  pull(r.squared)
```

```
#> [1] 0.01320884
```

```r
glance(m_int) %>% 
  pull(r.squared)
```

```
#> [1] 0.0176922
```

- Clearly the model with interactions has a higher `$R^2$`.

- However using `$R^2$` for model selection in models with multiple explanatory 
  variables is not a good idea as `$R^2$` increases when **any** variable 
  is added to the model.

---

## `$R^2$` - first principles

- We can phrase explained variance using the following ratio of sums of squares:

$$ R^2 =  1 - \left( \frac{ SS\_{Error} }{ SS\_{Total} } \right) $$

where `$SS_{Error}$` is the sum of square residuals and `$SS_{Total}$` is the total
variance in the response variable.

- But remember, adding ANY explanatory variable will always increase `$R^2$`

---

## Adjusted `$R^2$`

$$ R^2\_{adj} = 1 - \left( \frac{ SS\_{Error} }{ SS\_{Total} } \times \frac{n - 1}{n - k - 1} \right), $$

where `$n$` is the number of observations and `$k$` is the number of predictors in 
the model.

- Adjusted `$R^2$` doesn't increase if the new variable does not provide any new 
information or is completely unrelated.

- This makes adjusted `$R^2$` a preferable metric for model selection in multiple
regression models.

---

## Occam's Razor

- Occam's Razor states that among competing hypotheses that predict equally 
well, the one with the fewest assumptions should be selected.

- Model selection follows this principle.

- We only want to add another variable to the model if the addition of that
variable brings something valuable in terms of predictive power to the model.

- In other words, we prefer the simplest best model, i.e. 
<font class="vocab">parsimonious</font> model.

---

## Comparing models

It appears that adding the interaction actually increased adjusted `$R^2$`, so we 
should indeed use the model with the interactions.

```r
glance(m_main) %>% 
  pull(adj.r.squared)
```

```
#> [1] 0.01258977
```

```r
glance(m_int) %>% 
  pull(adj.r.squared)
```

```
#> [1] 0.01676753
```

---

# Model selection

---

## Backwards elimination

- Start with a **full** model (including all candidate explanatory variables 
  and all candidate interactions).

- Remove one variable at a time, and select the model with the highest 
  adjusted `$R^2$`.

- Continue until adjusted `$R^2$` does not increase.

---

## Forward selection

- Start with an **empty** model.

- Add one variable (or interaction effect) at a time, and select the model with 
  the highest adjusted `$R^2$`.

- Continue until adjusted `$R^2$` does not increase.

---

## Model selection and interaction effects

If an interaction is included in the model, the main effects of both of
those variables must also be in the model.

If a main effect is not in the model, then its interaction should not be 
in the model.

---

## Other model selection criteria

- Adjusted `$R^2$` is only one model selection criterion.

- There are many others out there - AIC and BIC are commonly used ones.

---

## Your turn

We will be revisiting the candy rankings data from HW 01

**Project:** Model selection for candy data

**Goal:** Come up with the model that "best" predicts win percentage of candies

---

## Planning

Decide on a subset of variables to consider for your analysis.

- Consider 7-10 total variables, including interactions.

- Consider at least two interactions in the model.

- Interactions should be between a categorical variable and a numeric variable.

- The more variables you consider the longer model selection will take so 
  keep that in mind.

---

## Task:

- Use backwards elimination to do model selection. <b>Make sure to show 
  each step of the decision process.</b> You don't have to interpret the models 
  at each stage.

- Provide interpretations for the slopes for your final model.

- Further instructions are given in the .Rmd file on the application exercise,
  available here https://classroom.github.com/a/p7Ej9Qm8