Multiple Linear Regression

# Multiple Linear Regression
## Interactions & Transformations
### Prof. Maria Tackett
### 02.17.20

---

### [Click for PDF of slides](09-transformations.pdf)

---

### Announcements

- Team Feedback #1 **due Wed, Feb 19 at 11:59p**
    - Check for email from Teammates 
    - Please provide honest and constructive feedback. This team feedback will be graded for completion.

- HW 03 **due Mon, Feb 24 at 11:59p**

---

### Today's Agenda

- Interactions

- Log Transformations

---

## Interactions

---

### Interaction Terms

- **Case**: Relationship of the predictor variable with the response depends on the value of another predictor variable
  + This is an .vocab[interaction effect]
  
- Create a new interaction variable that is one predictor variable times the other in the interaction

-  **Good Practice**: When including an interaction term, also *include the associated <u>main effects</u>* (each predictor variable on its own) even if their coefficients are not statistically significant

---

### Checking for interactions in the EDA

---

### The data

**Predictors**
- .vocab[`verified_income`]: Whether borrower's income source and amount have been verified (`Not Verified`, `Source Verified`, `Verified`)
- .vocab[`debt_to_income`]: Debt-to-income ratio, i.e. the percentage of a borrower's total debt divided by their total income
- .vocab[`bankruptcy`]: Indicator of whether borrower has had a bankruptcy in the past (`0`: No, `1`: Yes)
- .vocab[`term`]: Length of the loan in months
- .vocab[`credit_util`]: What fraction of total credit a borrower is utilizing, i.e. total credit utilizied divided by total credit limit

**Response**
- .vocab[`interest_rate`]: Interest rate for the loan

```
## Observations: 9,974
## Variables: 9
## $ verified_income  <chr> "Verified", "Not Verified", "Source Verified", "Not …
## $ debt_to_income   <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46, 23.66, 16.19…
## $ bankruptcy       <fct> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0…
## $ term             <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36, 60, 60, 36, …
## $ credit_util      <dbl> 0.54759517, 0.15003472, 0.66134832, 0.19673228, 0.75…
## $ interest_rate    <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72, 13.59, 11.99…
## $ debt_inc_cent    <dbl> -1.3019882, -14.2719882, 1.8380118, -9.1519882, 38.6…
## $ term_cent        <dbl> 16.725887, -7.274113, -7.274113, -7.274113, -7.27411…
## $ credit_util_cent <dbl> 0.14448914, -0.25307131, 0.25824229, -0.20637375, 0.…
```

---

### Add interaction term

```r
model_w_int <- lm(interest_rate ~ verified_income + debt_inc_cent + 
                   bankruptcy + term_cent + credit_util_cent +
                   debt_inc_cent * verified_income,
                 data = loans)
```

|term                                         | estimate| std.error| statistic| p.value|
|:--------------------------------------------|--------:|---------:|---------:|-------:|
|(Intercept)                                  |   11.298|     0.074|   151.764|   0.000|
|verified_incomeSource Verified               |    1.094|     0.100|    10.940|   0.000|
|verified_incomeVerified                      |    2.704|     0.119|    22.730|   0.000|
|debt_inc_cent                                |    0.032|     0.005|     6.527|   0.000|
|bankruptcy1                                  |    0.525|     0.133|     3.954|   0.000|
|term_cent                                    |    0.154|     0.004|    38.764|   0.000|
|credit_util_cent                             |    4.841|     0.163|    29.689|   0.000|
|verified_incomeSource Verified:debt_inc_cent |   -0.009|     0.007|    -1.243|   0.214|
|verified_incomeVerified:debt_inc_cent        |   -0.019|     0.007|    -2.699|   0.007|

---

### Understanding interactions

- **Different intercept**: `verified_incomeVerified` = 	2.704

- **Different slope** `verified_incomeVerified:debt_inc_cent` = -0.019

---

## Log Transformations

---

## Respiratory Rate vs. Age

- A high respiratory rate can potentially indicate a respiratory infection in children. In order to determine what indicates a "high" rate, we first want to understand the relationship between a child's age and their respiratory rate.

- The data contain the respiratory rate for 618 children ages 15 days to 3 years.

- **Variables**: 
    - <font class="vocab">`Age`</font>: age in months
    - <font class="vocab">`Rate`</font>: respiratory rate (breaths per minute)

---

## Rate vs. Age

```r
respiratory <- ex0824
ggplot(data=respiratory, aes(x=Age, y=Rate)) +
  geom_point() + 
  labs("Respiratory Rate vs. Age")
```

---

## Rate vs. Age

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
   <th style="text-align:right;"> conf.low </th>
   <th style="text-align:right;"> conf.high </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 47.052 </td>
   <td style="text-align:right;"> 0.504 </td>
   <td style="text-align:right;"> 93.317 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 46.062 </td>
   <td style="text-align:right;"> 48.042 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Age </td>
   <td style="text-align:right;"> -0.696 </td>
   <td style="text-align:right;"> 0.029 </td>
   <td style="text-align:right;"> -23.684 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> -0.753 </td>
   <td style="text-align:right;"> -0.638 </td>
  </tr>
</tbody>
</table>

---

## Log transformations

---

### Need to transform `$y$`

- Typically, a "fan-shaped" residual plot indicates the need for a transformation of the response variable `$y$`
  + `$\mathbf{\color{green}{\log(y)}}$`: Easiest to interpret

- When building a model: 
  + Choose a transformation and build the model on the transformed data
  + Reassess the residual plots
  + If the residuals plots did not sufficiently improve, try a new transformation!

---

### Log transformation on `$y$`

- Use when the residual plot shows "fan-shaped" pattern

- If we apply a log transformation to the response variable, we want to estimate the parameters for the model...
.alert[
`$$\log(y) = \beta_0 + \beta_1 x$$`
]

- We want to interpret the model in terms of `$y$` not `$\log(y)$`, so we write all interpretations in terms of

---

### Mean and logs

Suppose we have a set of values

```r
x <- c(3, 5, 6, 8, 10, 14, 19)
```

Let's find the mean of the logged values of x, i.e. `$\overline{\log(x)}$`

```r
log_x <- log(x)
mean(log_x)
```

```
## [1] 2.066476
```

Let's find mean of x and then log the mean value, i.e. `$\log(\bar{x})$`

```r
xbar <- mean(x)
log(xbar)
```

```
## [1] 2.228477
```

---

### Median and logs

```r
x <- c(3, 5, 6, 8, 10, 14, 19)
```

Let's find the median of the logged values of x, i.e. `$\text{Median}(\log(x))$`

```r
log_x <- log(x)
median(log_x)
```

```
## [1] 2.079442
```

Let's find median of x and then log the mean value, i.e. `$\log(\text{Median}(x))$`

```r
median_x <- median(x)
log(median_x)
```

```
## [1] 2.079442
```

---

### Mean, Median, and log

```r
x <- c(3, 5, 6, 8, 10, 14, 19)
```

`$$\overline{\log(x)} \neq \log(\bar{x})$$`

```r
mean(log_x) == log(xbar)
```

```
## [1] FALSE
```

`$$\text{Median}(\log(x)) = \log(\text{Median}(x))$$`

```r
median(log_x) == log(median_x)
```

```
## [1] TRUE
```

---

### Mean and median of `$\log(y)$`

- Recall that `$y = \beta_0 + \beta_1 x_i$` is the **mean** value of `$y$` at the given value `$x_i$`. This doesn't hold when we log-transform `$y$`

- The mean of the logged values is **not** equal to the log of the mean value. Therefore at a given value of `$x$`

.alert[
`$$\begin{aligned}\exp\{\text{Mean}(\log(y))\} \neq \text{Mean}(y) \\[5pt]
\Rightarrow \exp\{\beta_0 + \beta_1 x\} \neq \text{Mean}(y) \end{aligned}$$`
]

---

### Mean and median of `$\log(y)$`

- However, the median of the logged values **is** equal to the log of the median value. Therefore,

- If the distribution of `$\log(y)$` is symmetric about the regression line, for a given value `$x_i$`,
.alert[
`$$\text{Median}(\log(y)) = \text{Mean}(\log(y))$$`
]
---

### Interpretation with log-transformed `$y$`

- Given the previous facts, if `$\log(y) = \beta_0 + \beta_1 x$`, then 
.alert[
`$$\text{Median}(y) = \exp\{\beta_0\}\exp\{\beta_1 x\}$$`
]

- <font class="vocab">Intercept:</font> When `$x=0$`, the median of `$y$` is expected to be `$\exp\{\beta_0\}$`
<br>

- <font class="vocab">Slope: </font>For every one unit increase in `$x$`, the median of `$y$` is expected to multiply by a factor of `$\exp\{\beta_1\}$`

---

### log(Rate) vs. Age

```r
respiratory <- respiratory %>% mutate(log_rate = log(Rate))
```

---

### log(Rate) vs. Age

```r
log_model <- lm(log_rate ~ Age, data = respiratory)
```

---

### log(Rate) vs. Age

|term        | estimate| std.error| statistic| p.value| conf.low| conf.high|
|:-----------|--------:|---------:|---------:|-------:|--------:|---------:|
|(Intercept) |    3.845|     0.013|   304.500|       0|     3.82|     3.870|
|Age         |   -0.019|     0.001|   -25.839|       0|    -0.02|    -0.018|
<br>

<div class="countdown" id="timer_5e4a0a34" style="right:0;bottom:0;" data-warnwhen="30">
<code class="countdown-time"><span class="countdown-digits minutes">04</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>
]

---

### Confidence interval for `$\beta_j$`

- The confidence interval for the coefficient of `$x$` describing its relationship with `$\log(y)$` is

`$$\hat{\beta}_j \pm t^* SE(\hat{\beta_j})$$`

- The confidence interval for the coefficient of `$x$` describing its relationship with `$y$` is

---

### Coefficient of `Age`

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
   <th style="text-align:right;"> conf.low </th>
   <th style="text-align:right;"> conf.high </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 3.845 </td>
   <td style="text-align:right;"> 0.013 </td>
   <td style="text-align:right;"> 304.500 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 3.82 </td>
   <td style="text-align:right;"> 3.870 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Age </td>
   <td style="text-align:right;"> -0.019 </td>
   <td style="text-align:right;"> 0.001 </td>
   <td style="text-align:right;"> -25.839 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> -0.02 </td>
   <td style="text-align:right;"> -0.018 </td>
  </tr>
</tbody>
</table>

---

### Log Transformation on `$x$`

.pull-left[
<img src="09-transformations_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" />
]

.pull-right[
<img src="09-transformations_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" />
]

- Try a transformation on `$X$` if the scatterplot shows some curvature but the variance is constant for all values of `$X$`

---

### Model with Transformation on `$x$`

- <font class="vocab">Intercept: </font> When `$\log(x)=0$`, `$(x=1)$`, `$y$` is expected to be `$\beta_0$` (i.e. the mean of `$y$` is `$\beta_0$`)

- <font class="vocab">Slope: </font> When `$x$` is multiplied by a factor of `$\mathbf{C}$`, `$y$` is expected to change by `$\boldsymbol{\beta_1}\mathbf{\log(C)}$` units, i.e. the mean of `$y$` changes by `$\boldsymbol{\beta_1}\mathbf{\log(C)}$`
    - *Example*: when `$x$` is multiplied by a factor of 2, `$y$` is expected to change by `$\boldsymbol{\beta_1}\mathbf{\log(2)}$` units

---

### Rate vs. log(Age)

---

### Rate vs. Age

|term        | estimate| std.error| statistic| p.value| conf.low| conf.high|
|:-----------|--------:|---------:|---------:|-------:|--------:|---------:|
|(Intercept) |   50.135|     0.632|    79.330|       0|   48.893|    51.376|
|log_age     |   -5.982|     0.263|   -22.781|       0|   -6.498|    -5.467|
<br>

]

---

See [Log Transformations in Linear Regression](https://github.com/sta210-sp20/supplemental-notes/blob/master/log-transformations.pdf) for more details about interpreting regression models with log-transformed variables.