class: center, middle, inverse, title-slide # Transformations & Model Assessment ### Dr. Maria Tackett ### 02.11.19 --- ## Announcements - HW 02 due today - Lab 04 due Wednesday - HW 03 due Monday, Feb 18 --- ## R packages ```r library(tidyverse) library(knitr) library(broom) library(cowplot) # use plot_grid function library(Sleuth3) #ex0824 data ``` --- ## Respiratory Rate vs. Age - A high respiratory rate can potentially indicate a respiratory infection in children. In order to determine what indicates a "high" rate, we first want to understand the relationship between a child's age and their respiratory rate. - The data contain the respiratory rate for 618 children ages 15 days to 3 years. - **Variables**: - <font class="vocab">`Age`</font>: age in months - <font class="vocab">`Rate`</font>: respiratory rate (breaths per minute) --- ## Rate vs. Age ```r respiratory <- ex0824 ggplot(data=respiratory, aes(x=Age, y=Rate)) + geom_point() + labs("Respiratory Rate vs. Age") ``` <img src="08-transformations_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- ## Rate vs. Age <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 47.052 </td> <td style="text-align:right;"> 0.504 </td> <td style="text-align:right;"> 93.317 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 46.062 </td> <td style="text-align:right;"> 48.042 </td> </tr> <tr> <td style="text-align:left;"> Age </td> <td style="text-align:right;"> -0.696 </td> <td style="text-align:right;"> 0.029 </td> <td style="text-align:right;"> -23.684 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> -0.753 </td> <td style="text-align:right;"> -0.638 </td> </tr> </tbody> </table> <img src="08-transformations_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- class: middle, center ## Log transformations --- ## Need to transform `\(y\)` - Typically, a "fan-shaped" residual plot indicates the need for a transformation of the response variable `\(y\)` + `\(\mathbf{\color{green}{\log(y)}}\)`: Easiest to interpret -- - When building a model: + Choose a transformation and build the model on the transformed data + Reassess the residual plots + If the residuals plots did not sufficiently improve, try a new transformation! --- ## Log transformation on `\(y\)` - Use when the residual plot shows "fan-shaped" pattern - If we apply a log transformation to the response variable, we want to estimate the parameters for the model... .alert[ `$$\log(y) = \beta_0 + \beta_1 x$$` ] -- - We want to interpret the model in terms of `\(y\)` not `\(\log(y)\)`, so we write all interpretations in terms of .alert[ `$$y = \exp\{\beta_0 + \beta_1 x\} = \exp\{\beta_0\}\exp\{\beta_1x\}$$` ] --- ### Mean and median of `\(\log(y)\)` - Recall that `\(y = \beta_0 + \beta_1 x_i\)` is the **mean** value of `\(y\)` at the given value `\(x_i\)`. This doesn't hold when we log-transform `\(y\)` -- - The mean of the logged values is **not** equal to the log of the mean value. Therefore at a given value of `\(x\)` `$$\color{blue}{\begin{aligned}\exp\{\text{Mean}(\log(y))\} \neq \text{Mean}(y) \\[5pt] \Rightarrow \exp\{\beta_0 + \beta_1 x\} \neq \text{Mean}(y) \end{aligned}}$$` -- - However, the median of the logged values **is** equal to the log of the median value. Therefore, `$$\color{blue}{\exp\{\text{Median}(\log(y))\} = \text{Median}(y)}$$` -- - If the distribution of `\(\log(y)\)` is symmetric about the regression line, for a given value `\(x_i\)`, `$$\color{blue}{\text{Median}(\log(y)) = \text{Mean}(\log(y))}$$` --- ### Interpretation with log-transformed `\(y\)` - Given the previous facts, if `\(\log(y) = \beta_0 + \beta_1 x\)`, then `$$\mathbf{\text{Median}(y) = \boldsymbol{\exp\{\beta_0\}\exp\{\beta_1 x\}}}$$` <br> -- - <font class="vocab">Intercept:</font> When `\(x=0\)`, the median of `\(y\)` is expected to be `\(\exp\{\beta_0\}\)` -- - <font class="vocab">Slope: </font>For every one unit increase in `\(x\)`, the median of `\(y\)` is expected to multiply by a factor of `\(\exp\{\beta_1\}\)` --- ## log(Rate) vs. Age ```r respiratory <- respiratory %>% mutate(log_rate = log(Rate)) ``` <img src="08-transformations_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## log(Rate) vs. Age ```r log_model <- lm(log_rate ~ Age, data = respiratory) ``` <img src="08-transformations_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- ## log(Rate) vs. Age ```r kable(tidy(log_model, conf.int=TRUE),format="html", digits=3) ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 3.845 </td> <td style="text-align:right;"> 0.013 </td> <td style="text-align:right;"> 304.500 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3.82 </td> <td style="text-align:right;"> 3.870 </td> </tr> <tr> <td style="text-align:left;"> Age </td> <td style="text-align:right;"> -0.019 </td> <td style="text-align:right;"> 0.001 </td> <td style="text-align:right;"> -25.839 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> -0.02 </td> <td style="text-align:right;"> -0.018 </td> </tr> </tbody> </table> <br> .question[ 1. Write the model in terms of `\(\log(rate)\)`. 2. Write the model in terms of `\(rate\)`. Interpret the slope and intercept. ] --- ## Confidence interval for `\(\beta_j\)` - The confidence interval for the coefficient of `\(x\)` describing its relationship with `\(\log(y)\)` is `$$\hat{\beta}_j \pm t^* SE(\hat{\beta_j})$$` -- - The confidence interval for the coefficient of `\(x\)` describing its relationship with `\(y\)` is .alert[ `$$\exp\big\{\hat{\beta}_j \pm t^* SE(\hat{\beta_j})\big\}$$` ] --- ## Coefficient of `Age` ```r kable(tidy(log_model, conf.int=TRUE),format="html", digits=3) ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 3.845 </td> <td style="text-align:right;"> 0.013 </td> <td style="text-align:right;"> 304.500 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3.82 </td> <td style="text-align:right;"> 3.870 </td> </tr> <tr> <td style="text-align:left;"> Age </td> <td style="text-align:right;"> -0.019 </td> <td style="text-align:right;"> 0.001 </td> <td style="text-align:right;"> -25.839 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> -0.02 </td> <td style="text-align:right;"> -0.018 </td> </tr> </tbody> </table> .question[ Interpret the 95% confidence interval for the coefficient of `Age` in terms of `Rate`. ] --- ## Log Transformation on `\(x\)` .pull-left[ <img src="08-transformations_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="08-transformations_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ] - Try a transformation on `\(X\)` if the scatterplot shows some curvature but the variance is constant for all values of `\(X\)` --- ## Model with Transformation on `\(x\)` .alert[ `$$y = \beta_0 + \beta_1 \log(x)$$` ] <br> -- - <font class="vocab">Intercept: </font> When `\(\log(x)=0\)`, `\((x=1)\)`, `\(y\)` is expected to be `\(\beta_0\)` (i.e. the mean of `\(y\)` is `\(\beta_0\)`) -- - <font class="vocab">Slope: </font> When `\(x\)` is multiplied by a factor of `\(\mathbf{C}\)`, `\(y\)` is expected to change by `\(\boldsymbol{\beta_1}\mathbf{\log(C)}\)` units, i.e. the mean of `\(y\)` changes by `\(\boldsymbol{\beta_1}\mathbf{\log(C)}\)` - *Example*: when `\(x\)` is multiplied by a factor of 2, `\(y\)` is expected to change by `\(\boldsymbol{\beta_1}\mathbf{\log(2)}\)` units --- ## Rate vs. log(Age) <img src="08-transformations_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- ## Respiratory Rate vs. Age <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 50.134533 </td> <td style="text-align:right;"> 0.6319775 </td> <td style="text-align:right;"> 79.32961 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 48.893441 </td> <td style="text-align:right;"> 51.375625 </td> </tr> <tr> <td style="text-align:left;"> log.age </td> <td style="text-align:right;"> -5.982434 </td> <td style="text-align:right;"> 0.2626097 </td> <td style="text-align:right;"> -22.78070 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> -6.498153 </td> <td style="text-align:right;"> -5.466715 </td> </tr> </tbody> </table> <br> .question[ 1. Write the equation for the model of `\(y\)` regressed on `\(\log(x)\)`. 2. Interpret the intercept in the context of the problem. 3. Interpret the slope in terms of how the mean respiratory rate changes when a child's age doubles. 4. Suppose a doctor has a patient who is currently 3 years old. Will this model provide a reliable prediction of the child's respiratory rate when her age doubles? Why or why not? ] --- class: middle See [Log Transformations in Linear Regression](https://github.com/STA210-Sp19/supplemental-notes/blob/master/log-transformations.pdf) for more details about interpreting regression models with log-transformed variables. --- class: middle, center ## Model Assessment --- ## ANOVA table for regression We can use the Analysis of Variance (ANOVA) table to decompose the variability in our response variable | | Sum of Squares | DF | Mean Square | F-Stat| p-value | |------------------|----------------|--------------------|-------------|-------------|--------------------| | Regression (Model) | `$$\sum\limits_{i=1}^{n}(\hat{y}_i - \bar{y})^2$$` | `$$p$$` | `$$\frac{MSS}{p}$$` | `$$\frac{MMS}{RMS}$$` | `$$P(F > \text{F-Stat})$$` | | Residual | `$$\sum\limits_{i=1}^{n}(y_i - \hat{y}_i)^2$$` | `$$n-p-1$$` | `$$\frac{RSS}{n-p-1}$$` | | | | Total | `$$\sum\limits_{i=1}^{n}(y_i - \bar{y})^2$$` | `$$n-1$$` | `$$\frac{TSS}{n-1}$$` | | | The estimate of the regression variance, `\(\hat{\sigma}^2 = RMS\)` --- ## `\(R^2\)` - **Recall**: `\(R^2\)` is the proportion of the variation in the response variable explained by the regression model <br> -- - `\(R^2\)` will always increase as we add more variables to the model + If we add enough variables, we can always achieve `\(R^2=100\%\)` <br> -- - If we only use `\(R^2\)` to choose a best fit model, we will be prone to choose the model with the most predictor variables --- ## Adjusted `\(R^2\)` - <font class="vocab">Adjusted `\(R^2\)`</font>: a version of `\(R^2\)` that penalizes for unnecessary predictor variables <br> - Similar to `\(R^2\)`, it measures the proportion of variation in the response that is explained by the regression model <br> - Differs from `\(R^2\)` by using the mean squares rather than sums of squares and therefore adjusting for the number of predictor variables --- ## `\(R^2\)` and Adjusted `\(R^2\)` `$$R^2 = \frac{\text{Total Sum of Squares} - \text{Residual Sum of Squares}}{\text{Total Sum of Squares}}$$` <br> -- .alert[ `$$Adj. R^2 = \frac{\text{Total Mean Square} - \text{Residual Mean Square}}{\text{Total Mean Square}}$$` ] <br> -- - `\(Adj. R^2\)` can be used as a quick assessment to compare the fit of multiple models; however, it should not be the only assessment! -- - Use `\(R^2\)` when describing the relationship between the response and predictor variables