Formalizing Linear Models

# Formalizing Linear Models
## Intro to Data Science
### Shawn Santo
### 02-20-20

---

# Recall

---

## Vocabulary

- **Response** variable: variable whose behavior or variation you are 
  trying to understand, on the y-axis (dependent variable)

- **Explanatory** variables: other variables that you want to use to explain 
  the variation in the response, on the x-axis (independent variables); these
  are also referred to as predictors or features

- **Predicted** value: output of the **model function**
    - The model function gives the typical value of the response variable
    *conditioning* on the explanatory variables (what does this mean?)

- **Residuals:** show how far each case is from its model value
    - **Residual = Observed value - Predicted value**
    - Tells how far above/below the model function each case is

---

## How do we use models?

1. **Explanation**: Characterize the relationship between `$y$` and `$x$` via 
   *slopes* for numerical explanatory variables or *differences* for 
   categorical explanatory variables.

2. **Prediction**: Plug in `$x$`, get the predicted `$y$`

---

## Least squares regression

The regression line minimizes the sum of squared residuals. We consider this to
be the "best" line.

- Residuals: `$e_i = y_i - \hat{y}_i$`,

- The regression line minimizes `$\sum_{i = 1}^n e_i^2$`.

- Equivalently, minimizing `$\sum_{i = 1}^n [y_i - (b_0 + b_1~x_i)]^2$`

---

## Want to follow along?

Create a private repo: https://classroom.github.com/a/mlp19i6c

---

# Characterizing relationships with models

---

## Data and packages

```r
library(tidyverse)
library(broom)
```

```r
paris_paint <- read_csv("data/paris_paintings.csv", 
               na = c("n/a", "", "NA"))
```

<br/>

- [Paris Paintings Codebook](http://www2.stat.duke.edu/courses/Spring20/sta199.001/data/code_books/paris_codebook.html)

- Source: Printed catalogues of 28 auction sales in Paris, 1764- 1780

- 3,393 paintings, their prices, and descriptive details from sales 
  catalogues over 60 variables

---

## Models

```r
m_ht_wt <- lm(Height_in ~ Width_in, data = paris_paint)
tidy(m_ht_wt)
```

```
#> # A tibble: 2 x 5
#>   term        estimate std.error statistic  p.value
#>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)    3.62    0.254        14.3 8.82e-45
#> 2 Width_in       0.781   0.00950      82.1 0.
```

`$$\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}$$`

```r
m_ht_lands <- lm(Height_in ~ factor(landsALL), data = paris_paint)
tidy(m_ht_lands)
```

```
#> # A tibble: 2 x 5
#>   term              estimate std.error statistic  p.value
#>   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)          22.7      0.328      69.1 0.      
#> 2 factor(landsALL)1    -5.65     0.532     -10.6 7.97e-26
```

`$$\widehat{Height_{in}} = 22.68 - 5.65~landsALL$$`

---

## Models

```r
paris_paint %>% 
  mutate(
    school_pntg = factor(school_pntg),
    school_pntg = fct_relevel(school_pntg, "X")
  ) %>% 
  lm(Height_in ~ school_pntg, data = .) %>% 
  tidy()
```

```
#> # A tibble: 7 x 5
#>   term            estimate std.error statistic  p.value
#>   <chr>              <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)       16.9        2.24     7.53  6.65e-14
#> 2 school_pntgA      -2.87      10.3     -0.279 7.80e- 1
#> 3 school_pntgD/FL   -0.540      2.27    -0.238 8.12e- 1
#> 4 school_pntgF       7.33       2.28     3.22  1.30e- 3
#> 5 school_pntgG      -1.22       6.72    -0.181 8.56e- 1
#> 6 school_pntgI       7.42       2.35     3.15  1.62e- 3
#> 7 school_pntgS      27.6        5.81     4.75  2.16e- 6
```

`$$\widehat{Height_{in}} = 16.9 - 2.87~sch_A - 0.54~sch_D + 7.33~sch_F -1.22~sch_G + 7.42~sch_I + 27.6~sch_S$$`

---

# Prediction with models

---

## Predict height from width

On average, how tall are paintings that are 60 inches wide?
`$$\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}$$`

```r
3.62 + 0.78 * 60
```

```
#> [1] 50.42
```

"On average, we expect paintings that are 60 inches wide to be 50.42 inches 
high."

**Warning:** We "expect" this to happen, but there will be some variability. 
(We'll learn about measuring the variability around the prediction later.)

---

## Prediction vs. extrapolation

On average, how tall are paintings that are 400 inches wide?
`$$\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}$$`

---

# Measuring model fit

---

## Measuring the strength of the fit

- The strength of the fit of a linear model is most commonly evaluated 
  using `$R^2$`.

- It tells us what percent of variability in the response variable is 
  explained by the model.

- The remainder of the variability is explained by variables not included in 
  the model.

- `$R^2$` is sometimes called the coefficient of determination.

---

## Obtaining `$R^2$` with `glance()`

Height vs. width

```r
glance(m_ht_wt)
```

```
#> # A tibble: 1 x 11
#>   r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC
#>       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>   <dbl>  <dbl>
#> 1     0.683         0.683  8.30     6749.       0     2 -11083. 22173.
#> # … with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>
```

```r
m_ht_wt %>% 
  glance() %>% 
  pull(r.squared)
```

```
#> [1] 0.6829468
```

Roughly 68% of the variability in heights of paintings can be explained by their
widths.

---

## Obtaining `$R^2$`

Height vs. landscape features

```r
m_ht_lands %>% 
  glance() %>% 
  pull(r.squared)
```

```
#> [1] 0.03456724
```

---

# Tidy regression output

---

## Tidy regression output

Let's revisit the model predicting heights of paintings from their widths.

```r
m_ht_wt <- lm(Height_in ~ Width_in, data = paris_paint)
```

---

## Not-so-tidy regression output

- You might come across these in your googling adventures, but we'll try 
  to stay away from them

- Not because they are wrong, but because they don't result in tidy data 
  frames as results.

---

## Not-so-tidy regression output 1

Option 1:

```r
m_ht_wt
```

```
#> 
#> Call:
#> lm(formula = Height_in ~ Width_in, data = paris_paint)
#> 
#> Coefficients:
#> (Intercept)     Width_in  
#>      3.6214       0.7808
```

---

## Not-so-tidy regression output 2

Option 2:

```r
summary(m_ht_wt)
```

```
#> 
#> Call:
#> lm(formula = Height_in ~ Width_in, data = paris_paint)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -86.714  -4.384  -2.422   3.169  85.084 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 3.621406   0.253860   14.27   <2e-16 ***
#> Width_in    0.780796   0.009505   82.15   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 8.304 on 3133 degrees of freedom
#>   (258 observations deleted due to missingness)
#> Multiple R-squared:  0.6829,	Adjusted R-squared:  0.6828 
#> F-statistic:  6749 on 1 and 3133 DF,  p-value: < 2.2e-16
```

---

## Recall

What makes a data frame tidy?

1. Each variable forms a column.

2. Each observation forms a row.

3. Each type of observational unit forms a table.

---

## Tidy regression output

Achieved with functions from the broom package:

- `tidy()`: constructs a data frame that summarizes the model's statistical 
  findings: coefficient estimates, *standard errors, test statistics, p-values*.

- `augment()`: adds columns to the original data that was modeled. This 
  includes predictions and residuals.

- `glance()`: constructs a concise one-row summary of the model. This typically 
  contains values such as `$R^2$`, adjusted `$R^2$`, 
  *and residual standard error that are computed once for the entire model*.

---

## Tidy your model's statistical findings

```r
tidy(m_ht_wt)
```

```r
tidy(m_ht_wt) %>%
  select(term, estimate) %>%
  mutate(estimate = round(estimate, 3))
```

```
#> # A tibble: 2 x 2
#>   term        estimate
#>   <chr>          <dbl>
#> 1 (Intercept)    3.62 
#> 2 Width_in       0.781
```

---

## Augment data with model results

New variables of note (for now):

- **`.fitted`**: Predicted value of the response variable
- **`.resid`**: Residuals

```r
augment(m_ht_wt) %>%
  slice(1:5)
```

```
#> # A tibble: 5 x 10
#>   .rownames Height_in Width_in .fitted .se.fit .resid    .hat .sigma
#>   <chr>         <dbl>    <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>
#> 1 1                37     29.5    26.7   0.166  10.3  3.99e-4   8.30
#> 2 2                18     14      14.6   0.165   3.45 3.96e-4   8.31
#> 3 3                13     16      16.1   0.158  -3.11 3.61e-4   8.31
#> 4 4                14     18      17.7   0.152  -3.68 3.37e-4   8.31
#> 5 5                14     18      17.7   0.152  -3.68 3.37e-4   8.31
#> # … with 2 more variables: .cooksd <dbl>, .std.resid <dbl>
```

Why might we be interested in these new variables?

---

## Residuals plot

```r
m_ht_wt_aug <- augment(m_ht_wt)
ggplot(m_ht_wt_aug, mapping = aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, color = "blue", lty = 2) +
  labs(x = "Predicted height", y = "Residuals") + theme_minimal()
```

<img src="lec07b-formalizing-linear-models_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" />
]

What does this plot tell us about the fit of the linear model?

---

## Glance to assess model fit

```r
glance(m_ht_wt)
```

```r
m_ht_wt %>% 
  glance() %>% 
  pull(r.squared)
```

```
#> [1] 0.6829468
```

The `$R^2$` is 68.29%.

---

# Exploring linearity

---

## Data: Paris paintings

---

## Price vs. width

**Describe the relationship between price and width of painting.**

---

## Let's focus on paintings with `Width_in < 100`

```r
paris_paint_wt_lt_100 <- paris_paint %>% 
  filter(Width_in < 100)
```

---

## Price vs. width

Which plot shows a more linear relationship?

.small[
  
.pull-left[
<img src="lec07b-formalizing-linear-models_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" />
]

.pull-right[
<img src="lec07b-formalizing-linear-models_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" />
]

]

---

## Price vs. width, residuals

Which plot shows a residuals that are uncorrelated with predicted values 
from the model?

.small[
  
.pull-left[
<img src="lec07b-formalizing-linear-models_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" />
]

.pull-right[
<img src="lec07b-formalizing-linear-models_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" />
]

]

<br/>

**What's the unit of residuals?**

---

## Transforming the data

- We saw that `price` has a right-skewed distribution, and the relationship
  between price and width of painting is non-linear.

- In these situations a transformation applied to the response variable 
  may be useful.

- In order to decide which transformation to use, we should examine 
  the distribution of the response variable.

- The extremely right skewed distribution suggests that a log transformation 
  may be useful.
    - Default base of the `log` function in R is the natural log: <br>
    `log(x, base = exp(1))`
    
---

## Log price vs. width

**How do we interpret the slope of this model?**

---

## Interpreting models with log transformation

```r
*m_lprice_wt <- lm(log(price) ~ Width_in, data = paris_paint_wt_lt_100)

m_lprice_wt %>%
  tidy() %>%
  select(term, estimate) %>%
  mutate(estimate = round(estimate, 3))
```

```
#> # A tibble: 2 x 2
#>   term        estimate
#>   <chr>          <dbl>
#> 1 (Intercept)    4.67 
#> 2 Width_in       0.019
```

---

## Linear model with log transformation

$$ \widehat{\log(price)} = 4.67 + 0.02 Width $$

- For each additional inch the painting is wider, the log price of the
painting is expected to be higher, on average, by 0.02 livres.

- which is not a very useful statement...

---

## Working with logs

- Subtraction and logs: `$\log(a) − \log(b) = \log(a / b)$`

- Natural logarithm: `$e^{\log(x)} = x$`

- We can use these identities to "undo" the log transformation

---

## Interpreting models with log transformation

The slope coefficient for the log transformed model is 0.02, meaning the log 
price difference between paintings whose widths are one inch apart is predicted 
to be 0.02 log livres.

$$ \log(\text{price for width x+1}) - \log(\text{price for width x}) = 0.02 $$

$$ \log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right) = 0.02 $$

$$ e^{\log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right)} = e^{0.02} $$

$$ \frac{\text{price for width x+1}}{\text{price for width x}} \approx 1.02 $$

For each additional inch the painting is wider, the price of the
painting is expected to be higher, on average, by a factor of 1.02.

---

## Shortcuts in R

```r
m_lprice_wt %>%
  tidy() %>%
  select(term, estimate) %>%
  mutate(estimate = round(estimate, 3))
```

```
#> # A tibble: 2 x 2
#>   term        estimate
#>   <chr>          <dbl>
#> 1 (Intercept)    4.67 
#> 2 Width_in       0.019
```

```r
m_lprice_wt %>%
  tidy() %>%
  select(term, estimate) %>%
  mutate(estimate = round(exp(estimate), 3))
```

```
#> # A tibble: 2 x 2
#>   term        estimate
#>   <chr>          <dbl>
#> 1 (Intercept)   107.  
#> 2 Width_in        1.02
```

---

## Recap

- The most common transformation when the response variable is right skewed is 
the log transform: `$\log(y)$`, especially useful when the response variable is 
(extremely) right skewed.

- This transformation is also useful for variance stabilization.

- When using a log transformation on the response variable the interpretation of 
the slope changes: *"For each unit increase in x, y is expected to multiply by a factor of `$e^{b_1}$`."*

- Another useful transformation is the square root: `$\sqrt{y}$`, especially 
useful when the response variable is counts.

---

## Transform, or learn more?

- Data transformations may also be useful when the relationship is non-linear

- However in those cases a polynomial regression may be more appropriate
  + This is beyond the scope of this course, but you’re welcomed to try it for
    your final project, and I’d be happy to provide further guidance

---

## Aside: when `$y = 0$`

In some cases the value of the response variable might be 0, and

```r
log(0)
```

```
#> [1] -Inf
```

The trick is to add a very small number to the value of the response variable for these cases so that the `log` function can still be applied:

```r
log(0 + 0.00001)
```

```
#> [1] -11.51293
```