class: center, middle, inverse, title-slide # Multiple Linear Regression ## Interactions & Transformations ### Prof. Maria Tackett ### 02.17.20 --- class: middle, center ### [Click for PDF of slides](09-transformations.pdf) --- ### Announcements - Team Feedback #1 **due Wed, Feb 19 at 11:59p** - Check for email from Teammates - Please provide honest and constructive feedback. This team feedback will be graded for completion. - HW 03 **due Mon, Feb 24 at 11:59p** --- ### Today's Agenda - Interactions - Log Transformations --- class: middle, center ## Interactions --- ### Interaction Terms - **Case**: Relationship of the predictor variable with the response depends on the value of another predictor variable + This is an .vocab[interaction effect] - Create a new interaction variable that is one predictor variable times the other in the interaction - **Good Practice**: When including an interaction term, also *include the associated <u>main effects</u>* (each predictor variable on its own) even if their coefficients are not statistically significant --- ### Checking for interactions in the EDA <img src="09-transformations_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- ### The data **Predictors** - .vocab[`verified_income`]: Whether borrower's income source and amount have been verified (`Not Verified`, `Source Verified`, `Verified`) - .vocab[`debt_to_income`]: Debt-to-income ratio, i.e. the percentage of a borrower's total debt divided by their total income - .vocab[`bankruptcy`]: Indicator of whether borrower has had a bankruptcy in the past (`0`: No, `1`: Yes) - .vocab[`term`]: Length of the loan in months - .vocab[`credit_util`]: What fraction of total credit a borrower is utilizing, i.e. total credit utilizied divided by total credit limit **Response** - .vocab[`interest_rate`]: Interest rate for the loan ``` ## Observations: 9,974 ## Variables: 9 ## $ verified_income <chr> "Verified", "Not Verified", "Source Verified", "Not … ## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46, 23.66, 16.19… ## $ bankruptcy <fct> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0… ## $ term <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36, 60, 60, 36, … ## $ credit_util <dbl> 0.54759517, 0.15003472, 0.66134832, 0.19673228, 0.75… ## $ interest_rate <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72, 13.59, 11.99… ## $ debt_inc_cent <dbl> -1.3019882, -14.2719882, 1.8380118, -9.1519882, 38.6… ## $ term_cent <dbl> 16.725887, -7.274113, -7.274113, -7.274113, -7.27411… ## $ credit_util_cent <dbl> 0.14448914, -0.25307131, 0.25824229, -0.20637375, 0.… ``` --- ### Add interaction term ```r model_w_int <- lm(interest_rate ~ verified_income + debt_inc_cent + bankruptcy + term_cent + credit_util_cent + debt_inc_cent * verified_income, data = loans) ``` |term | estimate| std.error| statistic| p.value| |:--------------------------------------------|--------:|---------:|---------:|-------:| |(Intercept) | 11.298| 0.074| 151.764| 0.000| |verified_incomeSource Verified | 1.094| 0.100| 10.940| 0.000| |verified_incomeVerified | 2.704| 0.119| 22.730| 0.000| |debt_inc_cent | 0.032| 0.005| 6.527| 0.000| |bankruptcy1 | 0.525| 0.133| 3.954| 0.000| |term_cent | 0.154| 0.004| 38.764| 0.000| |credit_util_cent | 4.841| 0.163| 29.689| 0.000| |verified_incomeSource Verified:debt_inc_cent | -0.009| 0.007| -1.243| 0.214| |verified_incomeVerified:debt_inc_cent | -0.019| 0.007| -2.699| 0.007| --- ### Understanding interactions - **Different intercept**: `verified_incomeVerified` = 2.704 - **Different slope** `verified_incomeVerified:debt_inc_cent` = -0.019 --- class: middle, center ## Log Transformations --- ## Respiratory Rate vs. Age - A high respiratory rate can potentially indicate a respiratory infection in children. In order to determine what indicates a "high" rate, we first want to understand the relationship between a child's age and their respiratory rate. - The data contain the respiratory rate for 618 children ages 15 days to 3 years. - **Variables**: - <font class="vocab">`Age`</font>: age in months - <font class="vocab">`Rate`</font>: respiratory rate (breaths per minute) --- ## Rate vs. Age ```r respiratory <- ex0824 ggplot(data=respiratory, aes(x=Age, y=Rate)) + geom_point() + labs("Respiratory Rate vs. Age") ``` <img src="09-transformations_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## Rate vs. Age <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 47.052 </td> <td style="text-align:right;"> 0.504 </td> <td style="text-align:right;"> 93.317 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 46.062 </td> <td style="text-align:right;"> 48.042 </td> </tr> <tr> <td style="text-align:left;"> Age </td> <td style="text-align:right;"> -0.696 </td> <td style="text-align:right;"> 0.029 </td> <td style="text-align:right;"> -23.684 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> -0.753 </td> <td style="text-align:right;"> -0.638 </td> </tr> </tbody> </table> <img src="09-transformations_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- class: middle, center ## Log transformations --- ### Need to transform `\(y\)` - Typically, a "fan-shaped" residual plot indicates the need for a transformation of the response variable `\(y\)` + `\(\mathbf{\color{green}{\log(y)}}\)`: Easiest to interpret -- - When building a model: + Choose a transformation and build the model on the transformed data + Reassess the residual plots + If the residuals plots did not sufficiently improve, try a new transformation! --- ### Log transformation on `\(y\)` - Use when the residual plot shows "fan-shaped" pattern - If we apply a log transformation to the response variable, we want to estimate the parameters for the model... .alert[ `$$\log(y) = \beta_0 + \beta_1 x$$` ] -- - We want to interpret the model in terms of `\(y\)` not `\(\log(y)\)`, so we write all interpretations in terms of .alert[ `$$y = \exp\{\beta_0 + \beta_1 x\} = \exp\{\beta_0\}\exp\{\beta_1x\}$$` ] --- ### Mean and logs Suppose we have a set of values ```r x <- c(3, 5, 6, 8, 10, 14, 19) ``` -- Let's find the mean of the logged values of x, i.e. `\(\overline{\log(x)}\)` ```r log_x <- log(x) mean(log_x) ``` ``` ## [1] 2.066476 ``` -- Let's find mean of x and then log the mean value, i.e. `\(\log(\bar{x})\)` ```r xbar <- mean(x) log(xbar) ``` ``` ## [1] 2.228477 ``` --- ### Median and logs ```r x <- c(3, 5, 6, 8, 10, 14, 19) ``` -- Let's find the median of the logged values of x, i.e. `\(\text{Median}(\log(x))\)` ```r log_x <- log(x) median(log_x) ``` ``` ## [1] 2.079442 ``` -- Let's find median of x and then log the mean value, i.e. `\(\log(\text{Median}(x))\)` ```r median_x <- median(x) log(median_x) ``` ``` ## [1] 2.079442 ``` --- ### Mean, Median, and log ```r x <- c(3, 5, 6, 8, 10, 14, 19) ``` -- `$$\overline{\log(x)} \neq \log(\bar{x})$$` ```r mean(log_x) == log(xbar) ``` ``` ## [1] FALSE ``` -- `$$\text{Median}(\log(x)) = \log(\text{Median}(x))$$` ```r median(log_x) == log(median_x) ``` ``` ## [1] TRUE ``` --- ### Mean and median of `\(\log(y)\)` - Recall that `\(y = \beta_0 + \beta_1 x_i\)` is the **mean** value of `\(y\)` at the given value `\(x_i\)`. This doesn't hold when we log-transform `\(y\)` -- - The mean of the logged values is **not** equal to the log of the mean value. Therefore at a given value of `\(x\)` .alert[ `$$\begin{aligned}\exp\{\text{Mean}(\log(y))\} \neq \text{Mean}(y) \\[5pt] \Rightarrow \exp\{\beta_0 + \beta_1 x\} \neq \text{Mean}(y) \end{aligned}$$` ] --- ### Mean and median of `\(\log(y)\)` - However, the median of the logged values **is** equal to the log of the median value. Therefore, .alert[ `$$\exp\{\text{Median}(\log(y))\} = \text{Median}(y)$$` ] -- - If the distribution of `\(\log(y)\)` is symmetric about the regression line, for a given value `\(x_i\)`, .alert[ `$$\text{Median}(\log(y)) = \text{Mean}(\log(y))$$` ] --- ### Interpretation with log-transformed `\(y\)` - Given the previous facts, if `\(\log(y) = \beta_0 + \beta_1 x\)`, then .alert[ `$$\text{Median}(y) = \exp\{\beta_0\}\exp\{\beta_1 x\}$$` ] <br><br> - <font class="vocab">Intercept:</font> When `\(x=0\)`, the median of `\(y\)` is expected to be `\(\exp\{\beta_0\}\)` <br> - <font class="vocab">Slope: </font>For every one unit increase in `\(x\)`, the median of `\(y\)` is expected to multiply by a factor of `\(\exp\{\beta_1\}\)` --- ### log(Rate) vs. Age ```r respiratory <- respiratory %>% mutate(log_rate = log(Rate)) ``` <img src="09-transformations_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- ### log(Rate) vs. Age ```r log_model <- lm(log_rate ~ Age, data = respiratory) ``` <img src="09-transformations_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> --- ### log(Rate) vs. Age |term | estimate| std.error| statistic| p.value| conf.low| conf.high| |:-----------|--------:|---------:|---------:|-------:|--------:|---------:| |(Intercept) | 3.845| 0.013| 304.500| 0| 3.82| 3.870| |Age | -0.019| 0.001| -25.839| 0| -0.02| -0.018| <br> .question[ - Go to http://bit.ly/sta210-sp20-logy and interpret the model.
04
:
00
] --- ### Confidence interval for `\(\beta_j\)` - The confidence interval for the coefficient of `\(x\)` describing its relationship with `\(\log(y)\)` is `$$\hat{\beta}_j \pm t^* SE(\hat{\beta_j})$$` -- - The confidence interval for the coefficient of `\(x\)` describing its relationship with `\(y\)` is .alert[ `$$\exp\big\{\hat{\beta}_j \pm t^* SE(\hat{\beta_j})\big\}$$` ] --- ### Coefficient of `Age` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 3.845 </td> <td style="text-align:right;"> 0.013 </td> <td style="text-align:right;"> 304.500 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3.82 </td> <td style="text-align:right;"> 3.870 </td> </tr> <tr> <td style="text-align:left;"> Age </td> <td style="text-align:right;"> -0.019 </td> <td style="text-align:right;"> 0.001 </td> <td style="text-align:right;"> -25.839 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> -0.02 </td> <td style="text-align:right;"> -0.018 </td> </tr> </tbody> </table> .question[ Interpret the 95% confidence interval for the coefficient of `Age` in terms of *rate*. ] --- ### Log Transformation on `\(x\)` .pull-left[ <img src="09-transformations_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="09-transformations_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> ] - Try a transformation on `\(X\)` if the scatterplot shows some curvature but the variance is constant for all values of `\(X\)` --- ### Model with Transformation on `\(x\)` .alert[ `$$y = \beta_0 + \beta_1 \log(x)$$` ] <br> -- - <font class="vocab">Intercept: </font> When `\(\log(x)=0\)`, `\((x=1)\)`, `\(y\)` is expected to be `\(\beta_0\)` (i.e. the mean of `\(y\)` is `\(\beta_0\)`) -- - <font class="vocab">Slope: </font> When `\(x\)` is multiplied by a factor of `\(\mathbf{C}\)`, `\(y\)` is expected to change by `\(\boldsymbol{\beta_1}\mathbf{\log(C)}\)` units, i.e. the mean of `\(y\)` changes by `\(\boldsymbol{\beta_1}\mathbf{\log(C)}\)` - *Example*: when `\(x\)` is multiplied by a factor of 2, `\(y\)` is expected to change by `\(\boldsymbol{\beta_1}\mathbf{\log(2)}\)` units --- ### Rate vs. log(Age) <img src="09-transformations_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> --- ### Rate vs. Age |term | estimate| std.error| statistic| p.value| conf.low| conf.high| |:-----------|--------:|---------:|---------:|-------:|--------:|---------:| |(Intercept) | 50.135| 0.632| 79.330| 0| 48.893| 51.376| |log_age | -5.982| 0.263| -22.781| 0| -6.498| -5.467| <br> .question[ Go to http://bit.ly/sta210-sp20-logx and interpret the model.
04
:
00
] --- class: middle See [Log Transformations in Linear Regression](https://github.com/sta210-sp20/supplemental-notes/blob/master/log-transformations.pdf) for more details about interpreting regression models with log-transformed variables.