Transformations & Model Assessment

# Transformations & Model Assessment
### Dr. Maria Tackett
### 02.11.19

---

## Announcements

- HW 02 due today

- Lab 04 due Wednesday

- HW 03 due Monday, Feb 18

---

## R packages

```r
library(tidyverse)
library(knitr)
library(broom)
library(cowplot) # use plot_grid function
library(Sleuth3) #ex0824 data
```

---

## Respiratory Rate vs. Age

- A high respiratory rate can potentially indicate a respiratory infection in children. In order to determine what indicates a "high" rate, we first want to understand the relationship between a child's age and their respiratory rate.

- The data contain the respiratory rate for 618 children ages 15 days to 3 years.

- **Variables**: 
    - <font class="vocab">`Age`</font>: age in months
    - <font class="vocab">`Rate`</font>: respiratory rate (breaths per minute)

---

## Rate vs. Age

```r
respiratory <- ex0824
ggplot(data=respiratory, aes(x=Age, y=Rate)) +
  geom_point() + 
  labs("Respiratory Rate vs. Age")
```

---

## Rate vs. Age

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
   <th style="text-align:right;"> conf.low </th>
   <th style="text-align:right;"> conf.high </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 47.052 </td>
   <td style="text-align:right;"> 0.504 </td>
   <td style="text-align:right;"> 93.317 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 46.062 </td>
   <td style="text-align:right;"> 48.042 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Age </td>
   <td style="text-align:right;"> -0.696 </td>
   <td style="text-align:right;"> 0.029 </td>
   <td style="text-align:right;"> -23.684 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> -0.753 </td>
   <td style="text-align:right;"> -0.638 </td>
  </tr>
</tbody>
</table>

---

---

## Need to transform `$y$`

- Typically, a "fan-shaped" residual plot indicates the need for a transformation of the response variable `$y$`
  + `$\mathbf{\color{green}{\log(y)}}$`: Easiest to interpret

- When building a model: 
  + Choose a transformation and build the model on the transformed data
  + Reassess the residual plots
  + If the residuals plots did not sufficiently improve, try a new transformation!

---

## Log transformation on `$y$`

- Use when the residual plot shows "fan-shaped" pattern

- If we apply a log transformation to the response variable, we want to estimate the parameters for the model...
.alert[
`$$\log(y) = \beta_0 + \beta_1 x$$`
]

- We want to interpret the model in terms of `$y$` not `$\log(y)$`, so we write all interpretations in terms of

### Mean and median of `$\log(y)$`

- Recall that `$y = \beta_0 + \beta_1 x_i$` is the **mean** value of `$y$` at the given value `$x_i$`. This doesn't hold when we log-transform `$y$`

- The mean of the logged values is **not** equal to the log of the mean value. Therefore at a given value of `$x$`

`$$\color{blue}{\begin{aligned}\exp\{\text{Mean}(\log(y))\} \neq \text{Mean}(y) \\[5pt]
\Rightarrow \exp\{\beta_0 + \beta_1 x\} \neq \text{Mean}(y) \end{aligned}}$$`

- However, the median of the logged values **is** equal to the log of the median value. Therefore, 
`$$\color{blue}{\exp\{\text{Median}(\log(y))\} = \text{Median}(y)}$$`

- If the distribution of `$\log(y)$` is symmetric about the regression line, for a given value `$x_i$`,
`$$\color{blue}{\text{Median}(\log(y)) = \text{Mean}(\log(y))}$$`

---

### Interpretation with log-transformed `$y$`

- Given the previous facts, if `$\log(y) = \beta_0 + \beta_1 x$`, then

`$$\mathbf{\text{Median}(y) = \boldsymbol{\exp\{\beta_0\}\exp\{\beta_1 x\}}}$$`

<br>

- <font class="vocab">Intercept:</font> When `$x=0$`, the median of `$y$` is expected to be `$\exp\{\beta_0\}$`

- <font class="vocab">Slope: </font>For every one unit increase in `$x$`, the median of `$y$` is expected to multiply by a factor of `$\exp\{\beta_1\}$`

---

## log(Rate) vs. Age

```r
respiratory <- respiratory %>% mutate(log_rate = log(Rate))
```

---

## log(Rate) vs. Age

```r
log_model <- lm(log_rate ~ Age, data = respiratory)
```

---

## log(Rate) vs. Age

```r
kable(tidy(log_model, conf.int=TRUE),format="html", digits=3)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
   <th style="text-align:right;"> conf.low </th>
   <th style="text-align:right;"> conf.high </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 3.845 </td>
   <td style="text-align:right;"> 0.013 </td>
   <td style="text-align:right;"> 304.500 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 3.82 </td>
   <td style="text-align:right;"> 3.870 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Age </td>
   <td style="text-align:right;"> -0.019 </td>
   <td style="text-align:right;"> 0.001 </td>
   <td style="text-align:right;"> -25.839 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> -0.02 </td>
   <td style="text-align:right;"> -0.018 </td>
  </tr>
</tbody>
</table>
<br>

.question[
1. Write the model in terms of `$\log(rate)$`. 
2. Write the model in terms of `$rate$`. Interpret the slope and intercept.

]

---

## Confidence interval for `$\beta_j$`

- The confidence interval for the coefficient of `$x$` describing its relationship with `$\log(y)$` is

`$$\hat{\beta}_j \pm t^* SE(\hat{\beta_j})$$`

- The confidence interval for the coefficient of `$x$` describing its relationship with `$y$` is

---

## Coefficient of `Age`

```r
kable(tidy(log_model, conf.int=TRUE),format="html", digits=3)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
   <th style="text-align:right;"> conf.low </th>
   <th style="text-align:right;"> conf.high </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 3.845 </td>
   <td style="text-align:right;"> 0.013 </td>
   <td style="text-align:right;"> 304.500 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 3.82 </td>
   <td style="text-align:right;"> 3.870 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Age </td>
   <td style="text-align:right;"> -0.019 </td>
   <td style="text-align:right;"> 0.001 </td>
   <td style="text-align:right;"> -25.839 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> -0.02 </td>
   <td style="text-align:right;"> -0.018 </td>
  </tr>
</tbody>
</table>

---

## Log Transformation on `$x$`

.pull-left[
<img src="08-transformations_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" />
]

.pull-right[
<img src="08-transformations_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" />
]

- Try a transformation on `$X$` if the scatterplot shows some curvature but the variance is constant for all values of `$X$`

---

## Model with Transformation on `$x$`

- <font class="vocab">Intercept: </font> When `$\log(x)=0$`, `$(x=1)$`, `$y$` is expected to be `$\beta_0$` (i.e. the mean of `$y$` is `$\beta_0$`)

- <font class="vocab">Slope: </font> When `$x$` is multiplied by a factor of `$\mathbf{C}$`, `$y$` is expected to change by `$\boldsymbol{\beta_1}\mathbf{\log(C)}$` units, i.e. the mean of `$y$` changes by `$\boldsymbol{\beta_1}\mathbf{\log(C)}$`
    - *Example*: when `$x$` is multiplied by a factor of 2, `$y$` is expected to change by `$\boldsymbol{\beta_1}\mathbf{\log(2)}$` units

---

## Rate vs. log(Age)

---

## Respiratory Rate vs. Age

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
   <th style="text-align:right;"> conf.low </th>
   <th style="text-align:right;"> conf.high </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 50.134533 </td>
   <td style="text-align:right;"> 0.6319775 </td>
   <td style="text-align:right;"> 79.32961 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 48.893441 </td>
   <td style="text-align:right;"> 51.375625 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> log.age </td>
   <td style="text-align:right;"> -5.982434 </td>
   <td style="text-align:right;"> 0.2626097 </td>
   <td style="text-align:right;"> -22.78070 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> -6.498153 </td>
   <td style="text-align:right;"> -5.466715 </td>
  </tr>
</tbody>
</table>
<br>

1. Write the equation for the model of `$y$` regressed on `$\log(x)$`.

2. Interpret the intercept in the context of the problem.

3. Interpret the slope in terms of how the mean respiratory rate changes when a child's age doubles.

4. Suppose a doctor has a patient who is currently 3 years old. Will this model provide a reliable prediction of the child's respiratory rate when her age doubles? Why or why not?
]

---

See [Log Transformations in Linear Regression](https://github.com/STA210-Sp19/supplemental-notes/blob/master/log-transformations.pdf) for more details about interpreting regression models with log-transformed variables.

---

## Model Assessment

---

## ANOVA table for regression

We can use the Analysis of Variance (ANOVA) table to decompose the variability in our response variable

|  | Sum of Squares | DF | Mean Square | F-Stat| p-value |
|------------------|----------------|--------------------|-------------|-------------|--------------------|
| Regression (Model) | `$$\sum\limits_{i=1}^{n}(\hat{y}_i - \bar{y})^2$$` | `$$p$$` | `$$\frac{MSS}{p}$$` | `$$\frac{MMS}{RMS}$$` | `$$P(F > \text{F-Stat})$$` |
| Residual | `$$\sum\limits_{i=1}^{n}(y_i - \hat{y}_i)^2$$` | `$$n-p-1$$` | `$$\frac{RSS}{n-p-1}$$` |  |  |
| Total | `$$\sum\limits_{i=1}^{n}(y_i - \bar{y})^2$$` | `$$n-1$$` | `$$\frac{TSS}{n-1}$$` |  |  |

The estimate of the regression variance, `$\hat{\sigma}^2 = RMS$`

---

## `$R^2$`

- **Recall**: `$R^2$` is the proportion of the variation in the response variable explained by the regression model
<br>

- `$R^2$` will always increase as we add more variables to the model 
  + If we add enough variables, we can always achieve `$R^2=100\%$`
<br>

- If we only use `$R^2$` to choose a best fit model, we will be prone to choose the model with the most predictor variables

---

## Adjusted `$R^2$`

- <font class="vocab">Adjusted `$R^2$`</font>: a version of `$R^2$` that penalizes for unnecessary predictor variables
<br>

- Similar to `$R^2$`, it measures the proportion of variation in the response that is explained by the regression model 
<br>

- Differs from `$R^2$` by using the mean squares rather than sums of squares and therefore adjusting for the number of predictor variables

---

## `$R^2$` and Adjusted `$R^2$`

`$$R^2 = \frac{\text{Total Sum of Squares} - \text{Residual Sum of Squares}}{\text{Total Sum of Squares}}$$`
<br>

.alert[
`$$Adj. R^2 = \frac{\text{Total Mean Square} - \text{Residual Mean Square}}{\text{Total Mean Square}}$$`
]
<br>

- `$Adj. R^2$` can be used as a quick assessment to compare the fit of multiple models; however, it should not be the only assessment!

- Use `$R^2$` when describing the relationship between the response and predictor variables