Multiple linear regression

# Multiple linear regression
## Intro to Data Science
### Shawn Santo
### 02-25-20

---

## Announcements

- Homework 3 assigned Thursday

- Lab 5 due Friday, Feb 28 at 11:59pm

---

# Linear model with multiple predictors

---

## Data and packages

```r
library(tidyverse)
library(broom)
```

```r
paris_paintings <- read_csv("data/paris_paintings.csv", 
 na = c("n/a", "", "NA"))
```

- [Paris Paintings Codebook](http://www2.stat.duke.edu/courses/Spring20/sta199.001/data/code_books/paris_codebook.html)

- Source: Printed catalogues of 28 auction sales in Paris, 1764- 1780

- 3,393 paintings, their prices, and descriptive details from sales 
  catalogues over 60 variables

---

## Three variables

What is the typical surface area for paintings?

Less than 1000 square inches (which is roughly a painting that is 31in x 31in). 
There are very few paintings that have surface area above 5000 square inches.

---

For simplicity, let's focus on the paintings with `Surface < 5000`.

---

## Price, surface, and living artist

Does the relationship between surface and logged price vary by whether or not
the artist is living?

```r
ggplot(data = pp_surf_5000,
       mapping = aes(y = log(price), x = Surface, color = factor(artistliving))) +
  geom_point(alpha = 0.3) +
  labs(color = "Living artist", y = "Price (logarithm)") +
  theme_bw(base_size = 16)
```

<img src="lec-08a-mlr_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" />
]

---

## Modeling with main effects

```r
m_main <- lm(log(price) ~ Surface + factor(artistliving), 
 data = pp_surf_5000)

m_main %>%
  tidy() %>%
  select(term, estimate)
```

```
#> # A tibble: 3 x 2
#> term estimate
#> <chr> <dbl>
#> 1 (Intercept) 4.88 
#> 2 Surface 0.000265
#> 3 factor(artistliving)1 0.137
```

Linear model:

$$ \widehat{\log(price)} = 4.88 + 0.00027~surface + 0.14~artistliving $$

- Plug in 0 for `artistliving` to get the linear model for paintings by
  non-living artists.
  
- Plug in 1 for `artistliving` to get the linear model for paintings by living
  artists.

---

## Interpretation of main effects

---

- Non-living artist: 
`$$\begin{align}\widehat{\log(price)} &= 4.88 + 0.00027~surface + 0.14 \times 0\\
&= 4.88 + 0.00027~surface\\\end{align}$$`

- Living artist: 
`$$\begin{align}\widehat{\log(price)} &= 4.88 + 0.00027~surface + 0.14 \times 1\\
&= 5.02 + 0.00027~surface\\\end{align}$$`

- Rate of change in log price as the surface area of the painting increases 
  does not vary between paintings by living and non-living artists 
  (same slope)

- Paintings by living artists are consistently more expensive than paintings 
  by non-living artists (different intercept)

---

## Main effects, numerical and categorical predictors

```
#> # A tibble: 3 x 3
#> term estimate exp_estimate
#> <chr> <dbl> <dbl>
#> 1 (Intercept) 4.88 132. 
#> 2 Surface 0.000265 1.00
#> 3 factor(artistliving)1 0.137 1.15
```

- **All else held constant**, for each additional square inch in a painting's 
  surface area, the log price of the painting is predicted, on average, to 
  increase by 0.0003 (the price is expected to be higher by a multiplicative 
  factor of 1.0003.

- **All else held constant**, paintings by a living artist are predicted, on 
  average, to have a log price that is 0.137 greater compared to paintings by
  artists who are no longer alive (the price is expected to be higher by a 
  multiplicative factor of 
  1.1471)

- Paintings that are by an artist who is not alive and that have a surface area 
  of 0 square inches are predicted, on average, to be 
  131.6417 
  livres.

---

## What went wrong?

Why is our linear regression model different from what we got 
from `geom_smooth(method = "lm", se = FALSE)`?

---

## What went wrong?

Why is our linear regression model different from what we got 
from `geom_smooth(method = "lm", se = FALSE)`?

---

## What went wrong? (cont.)

- The way we specified our model only lets `artistliving` affect the intercept.

- Model implicitly assumes that paintings with living and deceased artists have
  the *same slope* and only allows for *different intercepts*.

- What seems more appropriate in this case? 
    
    * Same slope and same intercept for both colors?
    
    * Same slope and different intercept for both colors?
    
    * Different slope and different intercept for both colors?

---

## Interacting explanatory variables

- Including an interaction effect in the model allows for different slopes, i.e. 
  nonparallel lines.

- This implies that the regression coefficient for an explanatory variable would 
  change as another explanatory variable changes.

- This can be accomplished by adding an 
  **interaction variable**: the product of two explanatory variables.

---

## Price vs. surface and artist living interacting

```r
ggplot(data = pp_surf_5000,
       mapping = aes(y = log(price), x = Surface, color = factor(artistliving))) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Surface", y = "Log(price)", color = "Living artist") +
  theme_bw(base_size = 16)
```

<img src="lec-08a-mlr_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" />
]

---

## Modeling with interaction effects

```r
*m_int <- lm(log(price) ~ Surface * factor(artistliving),
 data = pp_surf_5000)

tidy(m_int) %>% 
  select(estimate)
```

```
#> # A tibble: 4 x 1
#> estimate
#> <dbl>
#> 1 4.91 
#> 2 0.000206
#> 3 -0.126 
#> 4 0.000479
```

`$$\widehat{log(price)} = 4.91  + 0.00021~surface - 0.126~artistliving  + 0.00048~surface \times artistliving$$`

---

## Interpretation of interaction effects

**Non-living artist:**

`$$\begin{align}\widehat{\log(price)} &= 4.91 + 0.00021~surface - 0.126 \times 0  + 0.00048~surface \times 0\\\\
&= 4.91 + 0.00021~surface\end{align}$$`

**Living artist:**

`$$\begin{align}\widehat{\log(price)} &= 4.91 + 0.00021~surface - 0.126 \times 1 + 0.00048~surface \times 1\\\\
&= 4.784 + 0.00069~surface\end{align}$$`

---

## Interpretation of interaction effects

**Non-living artist:**

`$$\widehat{log(price)} = 4.91 + 0.00021~surface$$`

**Living artist:**

`$$\widehat{log(price)} =  4.784 + 0.00069~surface$$`

- Rate of change in price as the surface area of the painting increases does 
 vary between paintings by living and non-living artists 
 (different slopes).

- Some paintings by living artists are more expensive than paintings by
 non-living artists, and some are not 
 (different intercept).

---

## Interpretation of interaction effects

- Non-living artist: 
`$$\widehat{log(price)} = 4.91 + 0.00021~surface$$`

- Living artist: 
`$$\widehat{log(price)} =  4.784 + 0.00069~surface$$`

---

## Continuous by continuous interactions

- Interpretation becomes trickier.

- Slopes conditional on values of explanatory variables.

- Come see me if you encounter these.

---

## Third order interactions

- Can you? Yes

- Should you? Probably not if you want to interpret these interactions in 
  context of the data.

---

# Your turn

---

## Fitting and interpreting linear models

Clone the repo at https://classroom.github.com/a/0sPhWFKl
and create an RStudio Cloud project. Don't forget to commit and push your 
changes.

Our goal is to predict the log price of paintings in this auction.

**Task 1**: Fit a multiple linear regression model with the following main 
effects: `lrgfont`, `Height_in`, `Width_in`, and `relig`.

Interpret each of the estimated model coefficients.

**Task 2**: Fit a multiple linear regression model with main effects for 
`Height_in` and `lrgfont`, as well as their interaction.

What is the model equation for paintings where the dealer devotes an additional 
paragraph? How about for paintings where the dealer does NOT devote an 
additional paragraph?

---

# Code: cleaning up categorical data

---

## Shape and material

Collapse levels of `Shape` and `mat`erial variables with 
**`forcats::fct_collapse`**:

```r
paris_paintings <- paris_paintings %>%
 mutate(
 Shape = fct_collapse(Shape, oval = c("oval", "ovale"),
 round = c("round", "ronde"),
 squ_rect = "squ_rect",
 other = c("octogon", "octagon", "miniature")),
 mat = fct_collapse(mat, metal = c("a", "br", "c"),
 canvas = c("co", "t", "ta"),
 paper = c("p", "ca"),
 wood = "b",
 other = c("e", "g", "h", "mi", "o", "pa", "v", "al", "ar", "m"))
 )
```
]

---

## Review fixes

```r
paris_paintings %>%
  count(Shape)
```

```
#> # A tibble: 5 x 2
#> Shape n
#> <fct> <int>
#> 1 other 12
#> 2 oval 52
#> 3 round 74
#> 4 squ_rect 3219
#> 5 <NA> 36
```
]

```r
paris_paintings %>%
  count(mat)
```

```
#> # A tibble: 6 x 2
#> mat n
#> <fct> <int>
#> 1 metal 321
#> 2 other 59
#> 3 wood 886
#> 4 paper 38
#> 5 canvas 1783
#> 6 <NA> 306
```
]