Multiple linear regression 🤹‍♀

class: center, middle, inverse, title-slide

# Multiple linear regression <br> 🤹‍♀

---

layout: true
  
<div class="my-footer">
<span>
Dr. Mine Çetinkaya-Rundel -
<a href="http://www2.stat.duke.edu/courses/Fall18/sta112.01/schedule" target="_blank">stat.duke.edu/courses/Fall18/sta112.01
</a>
</span>
</div>

---

## Announcements

- Peer evaluations due tonight at 11:59pm - 6/10 of you still need to fill them out
- Project assignment is posted, due next Tuesday

---

class: center, middle

# The linear model with multiple predictors

---

## Getting started

**Data:** Paris Paintings

```r
pp <- read_csv("data/paris_paintings.csv", 
               na = c("n/a", "", "NA"))
```

---

## Multiple predictors

- Response variable: log(price)
- Explanatory variables: Width and height

```r
m_wi_hgt <- lm(log(price) ~ Width_in + Height_in, data = pp)
tidy(m_wi_hgt)
```

```
## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   4.77     0.0579      82.4  0.      
## 2 Width_in      0.0269   0.00373      7.22 6.58e-13
## 3 Height_in    -0.0133   0.00395     -3.36 7.93e- 4
```

--
- Linear model:
`$$\widehat{log(price)} = 4.77 + 0.0269 width - 0.0133 height$$`

---

## Visualizing models with multiple predictors

<div id="htmlwidget-677144f1229fd569137e" style="width:100%;height:90%;" class="widgetframe html-widget"></div>
<script type="application/json" data-for="htmlwidget-677144f1229fd569137e">{"x":{"url":"u2_d04-multiple-linear-regression_files/figure-html//widgets/widget_plotly.html","options":{"xdomain":"*","allowfullscreen":false,"lazyload":false}},"evals":[],"jsHooks":[]}</script>

---

## Exploration 1

### Price, surface area, and living artist

- Explore the relationship between price of paintings and surface area, conditioned 
on whether or not the artist is still living
- First visualize and explore, then model

---

## Typical surface area

.question[
What is the typical surface area for paintings?
]

![](u2_d04-multiple-linear-regression_files/figure-html/viz-surf-artistliving-1.png)

Less than 1000 square inches (which is roughly a painting that is 31in x 31in). There are very few paintings that have surface area above 5000 square inches.

---

## Narrowing the scope

For simplicity let's focus on the paintings with `Surface < 5000`:

```r
pp_Surf_lt_5000 <- pp %>%
  filter(Surface < 5000)
```

![](u2_d04-multiple-linear-regression_files/figure-html/viz-surf-lt-5000-artistliving-1.png)

---

## Two ways to model

- **Main effects:** Assuming relationship between surface and logged price 
**does not vary** by whether or not the artist is living.
- **Interaction effects:** Assuming relationship between surface and logged 
price **varies** by whether or not the artist is living.

.pull-left[
![](u2_d04-multiple-linear-regression_files/figure-html/viz-main-effects-1.png)
]
.pull-right[
![](u2_d04-multiple-linear-regression_files/figure-html/viz-interaction-effects-1.png)
]

---

## Fit model with main effects

- Response variable: log(price)
- Explanatory variables: Surface area and artist living (0/1 variable)

.midi[

```r
m_main <- lm(log(price) ~ Surface + factor(artistliving), 
             data = pp_Surf_lt_5000)
tidy(m_main)
```

```
## # A tibble: 3 x 5
##   term                  estimate std.error statistic  p.value
##   <chr>                    <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)           4.88     0.0424       115.   0.      
## 2 Surface               0.000265 0.0000415      6.39 1.85e-10
## 3 factor(artistliving)1 0.137    0.0970         1.41 1.57e- 1
```
]

--
- Linear model:
$$ \widehat{log(price)} = 4.88 + 0.000265~surface + 0.137~artistliving $$

---

## Solving the model

![](u2_d04-multiple-linear-regression_files/figure-html/viz-main-effects-2-1.png)

--
- Non-living artist: Plug in 0 for `artistliving`
`$\widehat{log(price)} = 4.88 + 0.000265~surface + 0.137 \times 0$`  
`$= 4.88 + 0.000265~surface$`

--
- Living artist: Plug in 1 for `artistliving`
`$\widehat{log(price)} = 4.88 + 0.000265~surface + 0.137 \times 1$`  
`$= 5.017 + 0.000265~surface$`

---

## Visualizing main effects

![](u2_d04-multiple-linear-regression_files/figure-html/unnamed-chunk-1-1.png)

- **Same slope:** Rate of change in price as the surface area increases does 
not vary between paintings by living and non-living artists.
- **Different intercept:** Paintings by living artists are consistently more 
expensive than paintings by non-living artists.

---

## Interpreting main effects

.midi[

```r
tidy(m_main) %>% 
  mutate(exp_estimate = exp(estimate)) %>%
  select(term, estimate, exp_estimate)
```

```
## # A tibble: 3 x 3
##   term                  estimate exp_estimate
##   <chr>                    <dbl>        <dbl>
## 1 (Intercept)           4.88           132.  
## 2 Surface               0.000265         1.00
## 3 factor(artistliving)1 0.137            1.15
```
]

- All else held constant, for each additional square inch in painting's surface area, the price of the painting is predicted, on average, to be higher by a factor of 1.
- All else held constant, paintings by a living artist are predicted, on average, to be higher by a factor of 1.15 compared to paintings by an artist who is no longer alive.
- Paintings that are by an artist who is not alive and that have a surface area of 0 square inches are predicted, on average, to be 132 livres.

---

## What went wrong?

.question[
Why is our linear regression model different from what we got from `geom_smooth(method = "lm")`?
]

.pull-left[
![](u2_d04-multiple-linear-regression_files/figure-html/unnamed-chunk-2-1.png)
]
.pull-right[
![](u2_d04-multiple-linear-regression_files/figure-html/viz-main-effects3-1.png)
]

---

## What went wrong? (cont.)

- The way we specified our model only lets `artistliving` affect the intercept.
- Model implicitly assumes that paintings with living and deceased artists have the *same slope* and only allows for *different intercepts*.  
- What seems more appropriate in this case? 
    + Same slope and same intercept for both colors
    + Same slope and different intercept for both colors
    + Different slope and different intercept for both colors?

---

## Interacting explanatory variables

- Including an interaction effect in the model allows for different slopes, i.e. 
nonparallel lines.
- This implies that the regression coefficient for an explanatory variable would 
change as another explanatory variable changes.
- This can be accomplished by adding an interaction variable: the product of two 
explanatory variables.

---

## Interaction: surface * artist living

.small[

```r
ggplot(data = pp_Surf_lt_5000,
       mapping = aes(y = log(price), x = Surface, 
                     color = factor(artistliving))) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm") +
  labs(x = "Surface", y = "Log(price)", color = "Living artist")
```

![](u2_d04-multiple-linear-regression_files/figure-html/viz-interaction-effects1-1.png)
]

---

## Fit model with interaction effects

- Response variable: log(price)
- Explanatory variables: Surface area, artist living (0/1 variable), and 
their interaction

.midi[

```r
m_int <- lm(log(price) ~ Surface + factor(artistliving) + 
              Surface * factor(artistliving), 
            data = pp_Surf_lt_5000)
tidy(m_int)
```

```
## # A tibble: 4 x 5
##   term                           estimate std.error statistic    p.value
##   <chr>                             <dbl>     <dbl>     <dbl>      <dbl>
## 1 (Intercept)                    4.91     0.0432       114.   0         
## 2 Surface                        0.000206 0.0000442      4.65 0.00000337
## 3 factor(artistliving)1         -0.126    0.119         -1.06 0.289     
## 4 Surface:factor(artistliving)1  0.000479 0.000126       3.81 0.000139
```
]

- Linear model:
$$ \widehat{log(price)} = 4.91 + 0.00021~surface - 0.126~artistliving $$
$$+ ~ 0.00048~surface \times artistliving $$

---

## Interpretation of interaction effects

- Rate of change in price as the surface area of the painting increases does 
vary between paintings by living and non-living artists (different slopes), 
- Some paintings by living artists are more expensive than paintings by
non-living artists, and some are not (different intercept).

.small[
.pull-left[
- Non-living artist: 
`$\widehat{log(price)} = 4.91 + 0.00021~surface$`
`$- 0.126 \times 0 + 0.00048~surface \times 0$`
`$= 4.91 + 0.00021~surface$`
- Living artist: 
`$\widehat{log(price)} = 4.91 + 0.00021~surface$`
`$- 0.126 \times 1 + 0.00048~surface \times 1$`
`$= 4.91 + 0.00021~surface$`
`$- 0.126 + 0.00048~surface$`
`$= 4.784 + 0.00069~surface$`
]
.pull-right[
![](u2_d04-multiple-linear-regression_files/figure-html/viz-interaction-effects2-1.png)
]
]

---

## Third order interactions

- Can you? Yes
- Should you? Probably not if you want to interpret these interactions in context
of the data.