The language of models 📈

# The language of models <br> 📈

---

layout: true
  
<div class="my-footer">
<span>
Dr. Mine Çetinkaya-Rundel -
<a href="http://www2.stat.duke.edu/courses/Fall18/sta112.01/schedule" target="_blank">stat.duke.edu/courses/Fall18/sta112.01
</a>
</span>
</div>

---

## Announcements

- HW 6 due Tuesday
- New teams:
  - team 1: tyee, akshara, florence, yuncong
  - team 2: andres, liz, aaron, margaret
  - team 3: yi, ben, natalie, ritik
  - team 4: evdards, jackson, dora, justin
- Dr. Cetinkaya-Rundel available on Slack for questions + Eidan has OH Friday and Monday

---

# The language of models

---

## Modeling

- Use models to explain the relationship between variables and to make predictions
- For now we will focus on **linear** models (but remember there are *many* *many* other types of models too!)

---
class: center, middle

# Data: Paris Paintings

---

## Paris Paintings

```r
pp <- read_csv(
  "data/paris_paintings.csv", 
  na = c("n/a", "", "NA")
)
```

---

## Meet the data curators

Sandra van Ginhoven &nbsp; &nbsp; Hilary Coe Cronheim

PhD students in the Duke Art, Law, and Markets Initiative in 2013
]

- Source: Printed catalogues of 28 auction sales in Paris, 1764- 1780
- 3,393 paintings, their prices, and descriptive details from sales catalogues over 60 variables

---

## Auctions today

![](img/auction-video.png)

---

## Auctions back in the day

![](img/old-auction.png)

Pierre-Antoine de Machy, Public Sale at the Hôtel Bullion, Musée Carnavalet, Paris (18th century)

---

## Paris auction market

![](img/auction-trend-paris.png)

---

## Depart pour la chasse

![](img/depart-pour-la-chasse.png)

---

## Auction catalog text

]
.pull-right[
.small[
Two paintings very rich in composition, of a beautiful execution, and whose merit is very remarkable, each 17 inches 3 lines high, 23 inches wide; the first, painted on wood, comes from the Cabinet of Madame la Comtesse de Verrue; it represents a departure for the hunt: it shows in the front a child on a white horse, a man who gives the horn to gather the dogs, a falconer and other figures nicely distributed across the width of the painting; two horses drinking from a fountain; on the right in the corner a lovely country house topped by a terrace, on which people are at the table, others who play instruments; trees and fabriques pleasantly enrich the background.
]
]

---

![](img/painting1.png)
![](img/painting2.png)
![](img/painting3.png)

---

```r
pp %>% 
  filter(name == "R1777-89a") %>% 
  select(name:endbuyer) %>% 
  glimpse()
```

```
## Observations: 1
## Variables: 21
## $ name              <chr> "R1777-89a"
## $ sale              <chr> "R1777"
## $ lot               <chr> "89"
## $ position          <dbl> 0.3755274
## $ dealer            <chr> "R"
## $ year              <int> 1777
## $ origin_author     <chr> "D/FL"
## $ origin_cat        <chr> "D/FL"
## $ school_pntg       <chr> "D/FL"
## $ diff_origin       <int> 0
## $ logprice          <dbl> 8.575462
## $ price             <dbl> 5300
## $ count             <int> 1
## $ subject           <chr> "D\u008epart pour la chasse"
## $ authorstandard    <chr> "Wouwerman, Philips"
## $ artistliving      <int> 0
## $ authorstyle       <chr> NA
## $ author            <chr> "Philippe Wouwermans"
## $ winningbidder     <chr> "Langlier, Jacques for Poullain, Antoine"
## $ winningbiddertype <chr> "DC"
## $ endbuyer          <chr> "C"
```

---

# Modeling the relationship between variables

---

## Heights

```r
ggplot(data = pp, aes(x = Height_in)) +
  geom_histogram(bins = 30)
```

```
## Warning: Removed 252 rows containing non-finite values (stat_bin).
```

![](u2_d01-language-of-models_files/figure-html/height-dist-1.png)

---

## Widths

```r
ggplot(data = pp, aes(x = Width_in)) +
  geom_histogram(bins = 30)
```

```
## Warning: Removed 256 rows containing non-finite values (stat_bin).
```

![](u2_d01-language-of-models_files/figure-html/width-dist-1.png)

---

## Models as functions

- We can represent relationships between variables using **functions**
- A function is a mathematical concept: the relationship between an output
and one or more inputs. 
    - Plug in the inputs and receive back the output
    - Example: the formula `$y = 3x + 7$` is a function with input `$x$` and output `$y$`,
    when `$x$` is `$5$`, the output `$y$` is `$22$`
    ```
    y = 3 * 5 + 7 = 22
    ```

---

## Height as a function of width

![](u2_d01-language-of-models_files/figure-html/height-width-plot-1.png)

---

## Visualizing the linear model

```r
ggplot(data = pp, aes(x = Width_in, y = Height_in)) +
  geom_point() +
  geom_smooth(method = "lm") # lm for linear model
```

![](u2_d01-language-of-models_files/figure-html/height-width-plot-code-1.png)

---

## Visualizing the linear model

... without the measure of uncertainty around the line

```r
ggplot(data = pp, aes(x = Width_in, y = Height_in)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
```

![](u2_d01-language-of-models_files/figure-html/height-width-plot-no-se-1.png)

---

## Visualizing the linear model

... with different cosmetic choices for the line

```r
ggplot(data = pp, aes(x = Width_in, y = Height_in)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, 
              # color       # line type # line weight
              col = "pink", lty = 2,    lwd = 3)
```

![](u2_d01-language-of-models_files/figure-html/height-width-plot-pink-line-1.png)

---

## Vocabulary

- **Response variable:** Variable whose behavior or variation you are trying to 
understand, on the y-axis (dependent variable)
- **Explanatory variables:** Other variables that you want to use to explain the 
variation in the response, on the x-axis (independent variables)
- **Predicted value:** Output of the **model function**
    - The model function gives the typical (expected) value of the response 
    variable *conditioning* on the explanatory variables
- **Residuals:** A measure of how far each case is from its predicted value 
(based on a particular model)
    - Residual = Observed value - Predicted value
    - Tells how far above/below the expected value each case is

---

## Residuals

.question[
What does a negative residual mean? Which paintings on the plot have have 
negative residuals, those below or above the line?
]

![](u2_d01-language-of-models_files/figure-html/height-width-plot-no-se2-1.png)

---

.question[
The plot below displays the relationship between height and width of paintings. The only difference from the previous plot is that it uses a smaller alpha value, making the points somewhat transparent.

What feature is apparent in this plot that was not (as) apparent in the previous plots? What might be the reason for this feature?
]

![](u2_d01-language-of-models_files/figure-html/height-width-plot-alpha-1.png)

---

## Landscape paintings

- Landscape painting is the depiction in art of landscapes – natural scenery 
such as mountains, valleys, trees, rivers, and forests, especially where the 
main subject is a wide view – with its elements arranged into a coherent
composition.<sup>1</sup>
    - Landscape paintings tend to be wider than they are long.
- Portrait painting is a genre in painting, where the intent is to depict a 
human subject.<sup>2</sup>
    - Portrait paintings tend to be longer than they are wide.

.footnote[
[1] Source: Wikipedia, [Landscape painting](https://en.wikipedia.org/wiki/Landscape_painting)

[2] Source: Wikipedia, [Portait painting](https://en.wikipedia.org/wiki/Portrait_painting)
]

---

## Multiple explanatory variables

.question[
How, if at all, the relatonship between width and height of paintings vary by 
whether or not they have any landscape elements?
]
.small[

```r
ggplot(data = pp, aes(x = Width_in, y = Height_in, 
                      color = factor(landsALL))) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(color = "landscape")
```

![](u2_d01-language-of-models_files/figure-html/height-width-landscape-1.png)
]
---

## Extending regression lines

```r
ggplot(data = pp, aes(x = Width_in, y = Height_in, 
                      color = factor(landsALL))) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
  labs(color = "landscape")
```

![](u2_d01-language-of-models_files/figure-html/extrapolation-1.png)
]
---

## Models - upsides and downsides

- Models can sometimes reveal patterns that are not evident in a graph of the
data. This is a great advantage of modeling over simple visual inspection of
data. 
- There is a real risk, however, that a model is imposing structure that is
not really there on the scatter of data, just as people imagine animal shapes in
the stars. A skeptical approach is always warranted.

---

## Variation around the model...

is just as important as the model, if not more!

*Statistics is the explanation of variation in the context of what remains
unexplained.*

- The scatter suggests that there might be other factors that account for large 
parts of painting-to-painting variability, or perhaps just that randomness plays 
a big role.
- Adding more explanatory variables to a model can sometimes usefully reduce
the size of the scatter around the model. (We'll talk more about this later.)

---

## How do we use models?

- Explanation: Characterize the relationship between `$y$` and `$x$` via *slopes* 
for numerical explanatory variables or *differences* for categorical explanatory 
variables
- Prediction: Plug in `$x$`, get the predicted `$y$`

---

# Interpreting models

---

## Height & width

```r
m_ht_wt <- lm(Height_in ~ Width_in, data = pp)
tidy(m_ht_wt)
```

```
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    3.62    0.254        14.3 8.82e-45
## 2 Width_in       0.781   0.00950      82.1 0.
```

`$$\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}$$`

- **Slope:** For each additional inch the painting is wider, the height is 
expected to be higher, on average, by 0.78 inches.

--
- **Intercept:** Paintings that are 0 inches wide are expected to be 3.62 
inches high, on average.
    - Does this make sense?

---

## broom `$\in$` tidyverse

.pull-left[
![](img/broom-part-of-tidyverse.png)
]
.pull-right[
- **broom** follows tidyverse principles and tidies up regression output
- `tidy`: Constructs a tidy data frame summarizing model's statistical findings
- `glance`: Constructs a concise one-row summary of the model
- `augment`: Adds columns (e.g. predictions, residuals) to the original data that was modeled
]

---

## Linear model with a single predictor

- We're interested in `$\beta_0$` (population parameter for the intercept) and 
`$\beta_1$` (population parameter for the slope) in the following model:

$$ \hat{y} = \beta_0 + \beta_1~x $$

--
- Tough luck, you can't have them...

--
- So we use sample statistics to estimate them:

$$ \hat{y} = b_0 + b_1~x $$

---

## Least squares regression

- The regression line minimizes the sum of squared residuals.

--
- If `$e_i = y - \hat{y}$`, then, the regression line minimizes 
`$\sum_{i = 1}^n e_i^2$`.

---

## Visualizing residuals

![](u2_d01-language-of-models_files/figure-html/vis-res-1-1.png)

---

## Visualizing residuals (cont.)

![](u2_d01-language-of-models_files/figure-html/vis-res-2-1.png)

---

## Visualizing residuals (cont.)

![](u2_d01-language-of-models_files/figure-html/vis-res-3-1.png)

---

## Properties of the least squares regression line

- The regression line goes through the center of mass point, the coordinates corresponding to average `$x$` and average `$y$`: `$(\bar{x}, \bar{y})$`:  
`$$\hat{y} = b_0 + b_1 x ~ \rightarrow ~ b_0 = \hat{y} - b_1 x$$`
- The slope has the same sign as the correlation coefficient:  
`$$b_1 = r \frac{s_y}{s_x}$$`
- The sum of the residuals is zero:  
`$$\sum_{i = 1}^n e_i = 0$$`
- The residuals and `$x$` values are uncorrelated.

---

## Height & landscape features

```r
m_ht_lands <- lm(Height_in ~ factor(landsALL), data = pp)
tidy(m_ht_lands)
```

```
## # A tibble: 2 x 5
##   term              estimate std.error statistic  p.value
##   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)          22.7      0.328      69.1 0.      
## 2 factor(landsALL)1    -5.65     0.532     -10.6 7.97e-26
```

`$$\widehat{Height_{in}} = 22.68 - 5.65~landsALL$$`

---

## Height & landscape features (cont.)

- **Slope:** Paintings with landscape features are expected, on average,
to be 5.65 inches shorter than paintings that without landscape features.
    - Compares baseline level (`landsALL = 0`) to other level
    (`landsALL = 1`).
- **Intercept:** Paintings that don't have landscape features are expected, on 
average, to be 22.68 inches tall.

---

## Categorical predictor with 2 levels

```
## # A tibble: 8 x 3
##   name     price landsALL
##   <chr>    <dbl>    <int>
## 1 L1764-2    360        0
## 2 L1764-3      6        0
## 3 L1764-4     12        1
## 4 L1764-5a     6        1
## 5 L1764-5b     6        1
## 6 L1764-6      9        0
## 7 L1764-7a    12        0
## 8 L1764-7b    12        0
```

---

## Relationship between height and school

```r
m_ht_sch <- lm(Height_in ~ school_pntg, data = pp)
tidy(m_ht_sch)
```

```
## # A tibble: 7 x 5
##   term            estimate std.error statistic p.value
##   <chr>              <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)        14.        10.0     1.40  0.162  
## 2 school_pntgD/FL     2.33      10.0     0.232 0.816  
## 3 school_pntgF       10.2       10.0     1.02  0.309  
## 4 school_pntgG        1.65      11.9     0.139 0.889  
## 5 school_pntgI       10.3       10.0     1.02  0.306  
## 6 school_pntgS       30.4       11.4     2.68  0.00744
## 7 school_pntgX        2.87      10.3     0.279 0.780
```

--
- When the categorical explanatory variable has many levels, they're encoded to
**dummy variables**.
- Each coefficient describes the expected difference between heights in that 
particular school compared to the baseline level.

---

## Categorical predictor with >2 levels

```
## # A tibble: 7 x 7
## # Groups:   school_pntg [7]
##   school_pntg  D_FL     F     G     I     S     X
##   <chr>       <int> <int> <int> <int> <int> <int>
## 1 A               0     0     0     0     0     0
## 2 D/FL            1     0     0     0     0     0
## 3 F               0     1     0     0     0     0
## 4 G               0     0     1     0     0     0
## 5 I               0     0     0     1     0     0
## 6 S               0     0     0     0     1     0
## 7 X               0     0     0     0     0     1
```

---

## The linear model with multiple predictors

- Population model:

$$ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k $$

- Sample model that we use to estimate the population model:
  
$$ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k $$

---

## Correlation does not imply causation!

Remember this when interpreting model coefficients

![](img/cell_phones.png)