Language of Models

# Language of Models
### Dr. Maria Tackett
### 02.18.19

---

<div class="my-footer">
<span>
<a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a>
</span>
</div>

---

## Announcements

- Complete the reading (if you haven't already done so)

- Team Feedback 1 due Friday at 11:59p (will get an email after class with survey)

---

## Characterizing relationships with models

---

## Data & packages

```r
library(tidyverse)
library(broom)
```

```r
pp <- read_csv("data/paris_paintings.csv", 
               na = c("n/a", "", "NA"))
```

---

## Want to follow along?

Go to RStudio Cloud -> make a copy of "Modeling Paris Paintings"

---

## Height & width

```r
(m_ht_wt <- lm(Height_in ~ Width_in, data = pp))
```

```
## 
## Call:
## lm(formula = Height_in ~ Width_in, data = pp)
## 
## Coefficients:
## (Intercept)     Width_in  
##      3.6214       0.7808
```

<br>

`$$\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}$$`

- **Slope:** For each additional inch the painting is wider, the height is expected
to be higher, on average, by 0.78 inches.

- **Intercept:** Paintings that are 0 inches wide are expected to be 3.62 inches high,
on average.
    - Does this make sense?

---

### The linear model with a single predictor

- We're interested in the `$\beta_0$` (population parameter for the intercept)
and the `$\beta_1$` (population parameter for the slope) in the 
following model:

$$ \hat{y} = \beta_0 + \beta_1~x $$

- Tough luck, you can't have them...

- So we use the sample statistics to estimate them:

$$ \hat{y} = b_0 + b_1~x $$

---

## Least squares regression

The regression line minimizes the sum of squared residuals.

If `$e_i = y - \hat{y}$`,

then, the regression line minimizes `$\sum_{i = 1}^n e_i^2$`.

---

## Visualizing residuals

![](06b-formalizing-linear-models_files/figure-html/unnamed-chunk-4-1.png)

---

## Visualizing residuals (cont.)

![](06b-formalizing-linear-models_files/figure-html/unnamed-chunk-5-1.png)

---

## Visualizing residuals (cont.)

![](06b-formalizing-linear-models_files/figure-html/unnamed-chunk-6-1.png)

---

### Properties of the least squares regression line

- The slope has the same sign as the correlation coefficient:

`$$b_1 = r \frac{s_y}{s_x}$$`

- The regression line goes through the center of mass point, the coordinates corresponding to average `$x$` and average `$y$`: `$(\bar{x}, \bar{y})$`.

`$$\hat{y} = b_0 + b_1 x \hspace{5mm} ~ \Rightarrow  \hspace{5mm}~ b_0 = \bar{y} - b_1 \bar{x}$$`

---

### Properties of the least squares regression line

- The sum of the residuals is zero:

`$$\sum_{i = 1}^n e_i = 0$$`

- The residuals and `$x$` values are uncorrelated.

---

### Height & landscape features

```r
(m_ht_lands <- lm(Height_in ~ factor(landsALL), data = pp))
```

```
## 
## Call:
## lm(formula = Height_in ~ factor(landsALL), data = pp)
## 
## Coefficients:
##       (Intercept)  factor(landsALL)1  
##            22.680             -5.645
```

<br>

`$$\widehat{Height_{in}} = 22.68 - 5.65~landsALL$$`

---

### Height & landscape features (cont.)

- **Slope:** Paintings with landscape features are expected, on average,
to be 5.65 inches shorter than paintings that without landscape features.
    - Compares baseline level (`landsALL = 0`) to other level
    (`landsALL = 1`).

- **Intercept:** Paintings that don't have landscape features are expected, on 
average, to be 22.68 inches tall.

---

### Categorical predictor with 2 levels

```
## # A tibble: 8 x 3
##   name     price landsALL
##   <chr>    <dbl>    <int>
## 1 L1764-2    360        0
## 2 L1764-3      6        0
## 3 L1764-4     12        1
## 4 L1764-5a     6        1
## 5 L1764-5b     6        1
## 6 L1764-6      9        0
## 7 L1764-7a    12        0
## 8 L1764-7b    12        0
```

---

### Relationship between height and school

```r
(m_ht_sch <- lm(Height_in ~ school_pntg, data = pp))
```

```
## 
## Call:
## lm(formula = Height_in ~ school_pntg, data = pp)
## 
## Coefficients:
##     (Intercept)  school_pntgD/FL     school_pntgF     school_pntgG  
##          14.000            2.329           10.197            1.650  
##    school_pntgI     school_pntgS     school_pntgX  
##          10.287           30.429            2.869
```

- When the categorical explanatory variable has many levels, they're encoded to
<font class="vocab">dummy variables</font>.

- Each coefficient describes the expected difference between heights in that 
particular school compared to the baseline level.

---

### Categorical predictor with >2 levels

.small[
<div id="htmlwidget-bfc3bfe31eb1ad3a3471" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-bfc3bfe31eb1ad3a3471">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7"],["A","D/FL","F","G","I","S","X"],[0,1,0,0,0,0,0],[0,0,1,0,0,0,0],[0,0,0,1,0,0,0],[0,0,0,0,1,0,0],[0,0,0,0,0,1,0],[0,0,0,0,0,0,1]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>school_pntg<\/th>\n      <th>D_FL<\/th>\n      <th>F<\/th>\n      <th>G<\/th>\n      <th>I<\/th>\n      <th>S<\/th>\n      <th>X<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-right","targets":[2,3,4,5,6,7]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
]

---

### The linear model with multiple predictors

- Population model:

$$ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k $$
<br>

- Sample model that we use to estimate the population model:
  
$$ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k $$

---

### Correlation does not imply causation!

Remember this when interpreting model coefficients

---

# Prediction with models

---

## Predict height from width

.question[
On average, how tall are paintings that are 60 inches wide?
`$$\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}$$`
]

```r
3.62 + 0.78 * 60
```

```
## [1] 50.42
```

"On average, we expect paintings that are 60 inches wide to be 50.42 inches high."

**Warning:** We "expect" this to happen, but there will be some variability. (We'll
learn about measuring the variability around the prediction later.)

---

## Prediction vs. extrapolation

.question[
On average, how tall are paintings that are 400 inches wide?
`$$\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}$$`
]

![](06b-formalizing-linear-models_files/figure-html/extrapolate-1.png)

---

## Watch out for extrapolation!

> "When those blizzards hit the East Coast this winter, it proved to my satisfaction 
that global warming was a fraud. That snow was freezing cold. But in an alarming 
trend, temperatures this spring have risen. Consider this: On February 6th it was 10 
degrees. Today it hit almost 80. At this rate, by August it will be 220 degrees. So 
clearly folks the climate debate rages on."<sup>1</sup>  <br>
Stephen Colbert, April 6th, 2010

---

# Measuring model fit

---

## Measuring the strength of the fit

- The strength of the fit of a linear model is most commonly evaluated using `$R^2$`.

- It tells us what percent of variability in the response variable is explained by 
the model.

- The remainder of the variability is explained by variables not included in the 
model.

- `$R^2$` is sometimes called the coefficient of determination.

---

## Obtaining `$R^2$` in R

- Height vs. width
.small[

```r
glance(m_ht_wt)
```

```
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>   <dbl>  <dbl>
## 1     0.683         0.683  8.30     6749.       0     2 -11083. 22173.
## # … with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>
```

```r
glance(m_ht_wt)$r.squared # extract R-squared
```

```
## [1] 0.6829468
```
]

Roughly 68% of the variability in heights of paintings can be explained by their
widths.

---

## Obtaining `$R^2$` in R

- Height vs. landscape features
.small[

```r
glance(m_ht_lands)$r.squared
```

```
## [1] 0.03456724
```
]

---
class: center, middle

# Tidy regression output

---

Let's revisit the model predicting heights of paintings from their widths:

```r
m_ht_wt <- lm(Height_in ~ Width_in, data = pp)
```

---

## Not-so-tidy regression output

- You might come across these in your googling adventures, but we'll try to stay away from them

- Not because they are wrong, but because they don't result in tidy data frames as results.

---

## Not-so-tidy regression output (1)

Option 1:

```r
m_ht_wt
```

```
## 
## Call:
## lm(formula = Height_in ~ Width_in, data = pp)
## 
## Coefficients:
## (Intercept)     Width_in  
##      3.6214       0.7808
```

---

## Not-so-tidy regression output (2)

Option 2:

```r
summary(m_ht_wt)
```

```
## 
## Call:
## lm(formula = Height_in ~ Width_in, data = pp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -86.714  -4.384  -2.422   3.169  85.084 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.621406   0.253860   14.27   <2e-16 ***
## Width_in    0.780796   0.009505   82.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.304 on 3133 degrees of freedom
##   (258 observations deleted due to missingness)
## Multiple R-squared:  0.6829,	Adjusted R-squared:  0.6828 
## F-statistic:  6749 on 1 and 3133 DF,  p-value: < 2.2e-16
```

---

## Review

1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

---

## Tidy regression output

Achieved with functions from the broom package:

- **`tidy`**: Constructs a data frame that summarizes the model's statistical findings: coefficient estimates, *standard errors, test statistics, p-values*.

- **`augment`**: Adds columns to the original data that was modeled. This includes predictions and residuals.

- **`glance`**: Constructs a concise one-row summary of the model. This typically contains values such as `$R^2$`, adjusted `$R^2$`, *and residual standard error that are computed once for the entire model*.

---

### Tidy your model's statistical findings

```r
tidy(m_ht_wt)
```

```
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    3.62    0.254        14.3 8.82e-45
## 2 Width_in       0.781   0.00950      82.1 0.
```

```r
tidy(m_ht_wt) %>%
  select(term, estimate) %>%
  mutate(estimate = round(estimate, 3))
```

```
## # A tibble: 2 x 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)    3.62 
## 2 Width_in       0.781
```

---

### Augment data with model results

New variables of note (for now):

- **`.fitted`**: Predicted value of the response variable
- **`.resid`**: Residuals

```r
augment(m_ht_wt) %>%
  slice(1:5)
```

```
## # A tibble: 5 x 10
##   .rownames Height_in Width_in .fitted .se.fit .resid    .hat .sigma
##   <chr>         <dbl>    <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>
## 1 1                37     29.5    26.7   0.166  10.3  3.99e-4   8.30
## 2 2                18     14      14.6   0.165   3.45 3.96e-4   8.31
## 3 3                13     16      16.1   0.158  -3.11 3.61e-4   8.31
## 4 4                14     18      17.7   0.152  -3.68 3.37e-4   8.31
## 5 5                14     18      17.7   0.152  -3.68 3.37e-4   8.31
## # … with 2 more variables: .cooksd <dbl>, .std.resid <dbl>
```
]

---

## Residuals plot

```r
m_ht_wt_aug <- augment(m_ht_wt)
ggplot(m_ht_wt_aug, mapping = aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, color = "blue", lty = 2) +
  labs(x = "Predicted height", y = "Residuals")
```

![](06b-formalizing-linear-models_files/figure-html/unnamed-chunk-20-1.png)
]

---

## Glance to assess model fit

```r
glance(m_ht_wt)
```

```r
glance(m_ht_wt)$r.squared
```

```
## [1] 0.6829468
```

```
The $R^2$ is 68.29%.
```

The `$R^2$` is 68.29%.

```r
glance(m_ht_wt)$adj.r.squared
```

```
## [1] 0.6828456
```

---

# Exploring linearity

---

## Data: Paris Paintings

![](06b-formalizing-linear-models_files/figure-html/unnamed-chunk-24-1.png)

---

## Price vs. width

![](06b-formalizing-linear-models_files/figure-html/unnamed-chunk-25-1.png)

---

### Let's focus on paintings with `Width_in < 100`

```r
pp_wt_lt_100 <- pp %>% 
  filter(Width_in < 100)
```

---

## Price vs. width

.small[
  
.pull-left[
![](06b-formalizing-linear-models_files/figure-html/unnamed-chunk-27-1.png)
]

]

---

## Price vs. width, residuals

.question[
Which plot shows a residuals that are uncorrelated with predicted values from the model?
]

.small[
  
.pull-left[
![](06b-formalizing-linear-models_files/figure-html/unnamed-chunk-29-1.png)
]

]

---

## Transforming the data

- We saw that `price` has a right-skewed distribution, and the relationship between price and width of painting is non-linear.

- In these situations a transformation applied to the response variable may be useful.

- In order to decide which transformation to use, we should examine the distribution of the response variable.

- The extremely right skewed distribution suggests that a log transformation may 
be useful.
    - log = natural log, `$ln$`
    - Default base of the `log` function in R is the natural log: <br>
    `log(x, base = exp(1))`
    
---

## Logged price vs. width

![](06b-formalizing-linear-models_files/figure-html/unnamed-chunk-31-1.png)

---

### Interpreting models with log transformation

```r
m_lprice_wt <- lm(log(price) ~ Width_in, data = pp_wt_lt_100)
m_lprice_wt %>%
  tidy() %>%
  select(term, estimate) %>%
  mutate(estimate = round(estimate, 3))
```

```
## # A tibble: 2 x 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)    4.67 
## 2 Width_in       0.019
```

---

### Linear model with log transformation

$$ \widehat{log(price)} = 4.67 + 0.02 Width $$

- For each additional inch the painting is wider, the log price of the
painting is expected to be higher, on average, by 0.02 livres.

- which is not a very useful statement...

---

## Working with logs

- Subtraction and logs: `$log(a) − log(b) = log(a / b)$`

- Natural logarithm: `$e^{log(x)} = x$`

- We can use these identities to "undo" the log transformation

---

### Interpreting models with log transformation

The slope coefficient for the log transformed model is 0.02, meaning the log 
price difference between paintings whose widths are one inch apart is predicted 
to be 0.02 log livres.

$$ log(\text{price for width x+1}) - log(\text{price for width x}) = 0.02 $$

$$ log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right) = 0.02 $$

$$ e^{log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right)} = e^{0.02} $$

$$ \frac{\text{price for width x+1}}{\text{price for width x}} \approx 1.02 $$

For each additional inch the painting is wider, the price of the
painting is expected to be higher, on average, by a factor of 1.02.

---

## Shortcuts in R

```r
m_lprice_wt %>%
  tidy() %>%
  select(term, estimate) %>%
  mutate(estimate = round(estimate, 3))
```

```
## # A tibble: 2 x 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)    4.67 
## 2 Width_in       0.019
```

```r
m_lprice_wt %>%
  tidy() %>%
  select(term, estimate) %>%
  mutate(estimate = round(exp(estimate), 3))
```

```
## # A tibble: 2 x 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)   107.  
## 2 Width_in        1.02
```

---

## Recap

- Non-constant variance is one of the most common model violations, however it 
is usually fixable by transforming the response (y) variable.

- The most common transformation when the response variable is right skewed is 
the log transform: `$log(y)$`, especially useful when the response variable is 
(extremely) right skewed.

- This transformation is also useful for variance stabilization.

- When using a log transformation on the response variable the interpretation of 
the slope changes: *"For each unit increase in x, y is expected to multiply by a factor of `$e^{b_1}$`."*

- Another useful transformation is the square root: `$\sqrt{y}$`, especially 
useful when the response variable is counts.

---

## Transform, or learn more?

- Data transformations may also be useful when the relationship is non-linear

- However in those cases a polynomial regression may be more appropriate
  + This is beyond the scope of this course, but you’re welcomed to try it for your final project, and I’d be happy to provide further guidance

---

## Aside: when `$y = 0$`

In some cases the value of the response variable might be 0, and

```r
log(0)
```

```
## [1] -Inf
```

The trick is to add a very small number to the value of the response variable for these cases so that the `log` function can still be applied:

```r
log(0 + 0.00001)
```

```
## [1] -11.51293
```

---

# The linear model with multiple predictors

---

### Price, surface area, and living artist

![](06b-formalizing-linear-models_files/figure-html/unnamed-chunk-37-1.png)

Less than 1000 square inches (which is roughly a painting that is 31in x 31in). There are very few paintings that have surface area above 5000 square inches.

---

## Price, surface area, and living artist

For simplicity let's focus on the paintings with `Surface < 5000`:

![](06b-formalizing-linear-models_files/figure-html/unnamed-chunk-38-1.png)

---

## Price vs. surface and living artist

.question[
Does the relationship between surface and logged price vary by whether or not
the artist is living?
]

```r
ggplot(data = pp_Surf_lt_5000,
       mapping = aes(y = log(price), x = Surface, 
                     color = factor(artistliving))) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(color = "Living artist")
```

![](06b-formalizing-linear-models_files/figure-html/unnamed-chunk-39-1.png)
]

---

## Modeling with main effects

```r
m_main <- lm(log(price) ~ Surface + factor(artistliving), 
             data = pp_Surf_lt_5000)
m_main %>%
  tidy() %>%
  select(term, estimate) %>%
  mutate(estimate = round(estimate, 5))
```

```
## # A tibble: 3 x 2
##   term                  estimate
##   <chr>                    <dbl>
## 1 (Intercept)            4.88   
## 2 Surface                0.00027
## 3 factor(artistliving)1  0.137
```

Linear model:

$$ \widehat{log(price)} = 4.88 + 0.00027~surface + 0.14~artistliving $$

- Plug in 0 for `artistliving` to get the linear model for paintings by non-living
artists.

- Plug in 1 for `artistliving` to get the linear model for paintings by living
artists.

---

## Interpretation of main effects

.pull-right[
- Non-living artist: 
`$\widehat{log(price)} = 4.88 + 0.00027~surface + 0.14 \times 0$`
`$= 4.88 + 0.00027~surface$`

- Living artist: 
`$\widehat{log(price)} = 4.88 + 0.00027~surface + 0.14 \times 1$`
`$= 5.02 + 0.00027~surface$`
]

- Rate of change in price as the surface area of the painting increases does 
not vary between paintings by living and non-living artists (same slope),

- Paintings by living artists are consistently more expensive than paintings by
non-living artists (different intercept).