class: center, middle, inverse, title-slide # Multiple Linear Regression ## Special Predictors ### Dr. Maria Tackett ### 02.06.19 --- ## Announcements - Lab 03 due today - HW 02 due Mon, Feb 11 - Policy on [regrade requests](https://www2.stat.duke.edu/courses/Spring19/sta210.001/policies.html) --- ## R packages ```r library(tidyverse) library(knitr) library(broom) library(Sleuth3) # case 1202 dataset library(cowplot) # use plot_grid function ``` --- ## Starting wages data **Explanatory** - <font class="vocab">`Educ`: </font>years of Education - <font class="vocab">`Exper`: </font>months of previous work Experience (before hire at bank) - <font class="vocab">`Female`: </font>1 if female, 0 if male - <font class="vocab">`Senior`: </font>months worked at bank since hire - <font class="vocab">`Age`: </font>Age in months **Response** - <font class="vocab">`Bsal`: </font>annual salary at time of hire --- ## Starting wages ```r wages <- case1202 %>% mutate(Female = ifelse(Sex=="Female",1,0)) %>% select(-Sal77,-Sex) ``` ```r glimpse(wages) ``` ``` ## Observations: 93 ## Variables: 6 ## $ Bsal <int> 5040, 6300, 6000, 6000, 6000, 6840, 8100, 6000, 6000, 690… ## $ Senior <int> 96, 82, 67, 97, 66, 92, 66, 82, 88, 75, 89, 91, 66, 86, 9… ## $ Age <int> 329, 357, 315, 354, 351, 374, 369, 363, 555, 416, 481, 33… ## $ Educ <int> 15, 15, 15, 12, 12, 15, 16, 12, 12, 15, 12, 15, 15, 15, 1… ## $ Exper <dbl> 14.0, 72.0, 35.5, 24.0, 56.0, 41.5, 54.5, 32.0, 252.0, 13… ## $ Female <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, … ``` --- ## Regression model ```r model <- lm(Bsal ~ Senior + Age + Educ + Exper + Female, data=wages) kable(tidy(model,conf.int=TRUE),format="html",digits=3) ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 6277.893 </td> <td style="text-align:right;"> 652.271 </td> <td style="text-align:right;"> 9.625 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> 4981.434 </td> <td style="text-align:right;"> 7574.353 </td> </tr> <tr> <td style="text-align:left;"> Senior </td> <td style="text-align:right;"> -22.582 </td> <td style="text-align:right;"> 5.296 </td> <td style="text-align:right;"> -4.264 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> -33.108 </td> <td style="text-align:right;"> -12.056 </td> </tr> <tr> <td style="text-align:left;"> Age </td> <td style="text-align:right;"> 0.631 </td> <td style="text-align:right;"> 0.721 </td> <td style="text-align:right;"> 0.876 </td> <td style="text-align:right;"> 0.384 </td> <td style="text-align:right;"> -0.801 </td> <td style="text-align:right;"> 2.063 </td> </tr> <tr> <td style="text-align:left;"> Educ </td> <td style="text-align:right;"> 92.306 </td> <td style="text-align:right;"> 24.864 </td> <td style="text-align:right;"> 3.713 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> 42.887 </td> <td style="text-align:right;"> 141.725 </td> </tr> <tr> <td style="text-align:left;"> Exper </td> <td style="text-align:right;"> 0.501 </td> <td style="text-align:right;"> 1.055 </td> <td style="text-align:right;"> 0.474 </td> <td style="text-align:right;"> 0.636 </td> <td style="text-align:right;"> -1.597 </td> <td style="text-align:right;"> 2.598 </td> </tr> <tr> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> -767.913 </td> <td style="text-align:right;"> 128.970 </td> <td style="text-align:right;"> -5.954 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> -1024.255 </td> <td style="text-align:right;"> -511.571 </td> </tr> </tbody> </table> --- class: middle, center ## Math Details --- ## Regression Model - The multiple linear regression model assumes `$$\color{blue}{y|x_1, x_2, \ldots, x_p \sim N(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p, \sigma^2)}$$` <br> -- - For a given observation `\((x_{i1}, x_{i2}, \ldots, x_{iP}, y_i)\)`, we can rewrite the previous statement as `$$\color{blue}{y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} + \epsilon_{i} \hspace{10mm} \epsilon_i \sim N(0,\sigma^2)}$$` --- ## Estimating `\(\sigma^2\)` - For a given observation `\((x_{i1}, x_{i2}, \ldots,x_{ip}, y_i)\)` the residual is `$$\color{blue}{e_i = y_{i} - (\hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \hat{\beta}_{2} x_{i2} + \dots + \hat{\beta}_p x_{ip})}$$` -- - The estimated value of the regression variance , `\(\sigma^2\)`, is `$$\color{blue}{\hat{\sigma}^2 = \frac{RSS}{n-p-1} = \frac{\sum_{i=1}^ne_i^2}{n-p-1}}$$` --- # Estimating Coefficients - One way to estimate the coefficients is by taking partial derivatives of the formula `$$\color{blue}{\sum_{i=1}^n e_i^2 = \sum_{i=1}^{n}[y_i - (\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} \dots + \beta_p x_{ip})]^2}$$` -- - This produces messy formulas, but we can represent the regression equation using matrix notation --- ## Matrix Notation - `\(\mathbf{Y}\)`: `\(n \times 1\)` vector of responses - `\(\mathbf{X}\)`: `\(n \times (p+1)\)` matrix of intercept and predictor variables - `\(\boldsymbol{\beta}\)`: `\((p+1) \times 1\)` vector of coefficients `$$\mathbf{Y} = \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots\\ y_{n} \end{bmatrix} \hspace{15mm} \mathbf{X} = \begin{bmatrix} 1& x_{11} & x_{12} & \dots & x_{1p}\\ 1&x_{21} & x_{22} & \dots & x_{2p}\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1&x_{n1} & x_{n2} & \dots & x_{np} \end{bmatrix}\hspace{15mm}\boldsymbol{\beta}=\begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots\\ \beta_p \end{bmatrix}$$` <br> -- .alert[ `$$\mathbf{Y} = \mathbf{X}\boldsymbol{\beta}$$` ] --- ## Estimate `\(\beta_j\)`'s .alert[ `$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}$$` ] ```r # Estimate coefficients y <- wages %>% select(Bsal) %>% as.matrix() predictors <- wages %>% select(Senior, Age, Educ, Exper, Female) intercept <- data.frame(intercept=rep(1,nrow(wages))) x <- bind_cols(intercept, predictors) %>% as.matrix() beta <- solve(t(x)%*%x)%*%t(x)%*%y beta ``` ``` ## Bsal ## intercept 6277.8933861 ## Senior -22.5823029 ## Age 0.6309603 ## Educ 92.3060229 ## Exper 0.5006397 ## Female -767.9126888 ``` --- ## Estimate `\(\sigma^2\)` .alert[ `$$\hat{\sigma}^2 ~ = ~ \frac{\mathbf{e}^{T}\mathbf{e}}{n-p-1} ~ = ~ \frac{(\mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}})^{T}(\mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}})}{n-p-1}$$` ] ```r # Estimate sigma.sq e <- y - x%*%beta #residuals sigma.sq <- (t(e)%*%e)/(nrow(wages)-ncol(x)) (sigma.sq <- as.numeric(sigma.sq)) ``` ``` ## [1] 258156 ``` --- ## Calculate `\(SE(\hat{\beta}_j)\)` .alert[ `$$Var(\boldsymbol{\hat{\beta}}) = (\mathbf{X}^T\mathbf{X})^{-1}\hat{\sigma}^2$$` ] Take the square root of the diagonal elements to get `\(SE(\hat{\beta}_j)\)` ```r var.beta = solve(t(x)%*%x)*sigma.sq SE = sqrt(diag(var.beta)) SE ``` ``` ## intercept Senior Age Educ Exper Female ## 652.2713190 5.2957316 0.7206541 24.8635404 1.0552624 128.9700022 ``` --- class: middle See [Matrix Form of linear regression notes](https://github.com/STA210-Sp19/supplemental-notes/blob/master/regression-basics-matrix.pdf) for more details about the matrix form of linear regression. --- class: middle, center ## Special Predictors --- ## Interpreting the Intercept <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 6277.893 </td> <td style="text-align:right;"> 652.271 </td> <td style="text-align:right;"> 9.625 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Senior </td> <td style="text-align:right;"> -22.582 </td> <td style="text-align:right;"> 5.296 </td> <td style="text-align:right;"> -4.264 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Age </td> <td style="text-align:right;"> 0.631 </td> <td style="text-align:right;"> 0.721 </td> <td style="text-align:right;"> 0.876 </td> <td style="text-align:right;"> 0.384 </td> </tr> <tr> <td style="text-align:left;"> Educ </td> <td style="text-align:right;"> 92.306 </td> <td style="text-align:right;"> 24.864 </td> <td style="text-align:right;"> 3.713 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Exper </td> <td style="text-align:right;"> 0.501 </td> <td style="text-align:right;"> 1.055 </td> <td style="text-align:right;"> 0.474 </td> <td style="text-align:right;"> 0.636 </td> </tr> <tr> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> -767.913 </td> <td style="text-align:right;"> 128.970 </td> <td style="text-align:right;"> -5.954 </td> <td style="text-align:right;"> 0.000 </td> </tr> </tbody> </table> .question[ - Interpret the intercept. - Is this interpretation meaningful? Why or why not? ] --- ### Mean-Centered Variables - To aid with the interpretation of the intercept, you can subtract each continuous predictor variable from its mean value - Use these <font class="vocab3">mean-centered</font> variables in the regression model - Now the intercept is interpreted as the expected (i.e. mean) value of the response at the average value of the predictor variables --- ### Salary: Mean-Centered Variables ```r wages <- wages %>% mutate(SeniorCent = Senior - mean(Senior), AgeCent = Age-mean(Age), EducCent = Educ - mean(Educ), ExperCent = Exper - mean(Exper)) ``` .pull-left[ <img src="07-mlr-special-predictors_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="07-mlr-special-predictors_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> ] --- ### Salary: Mean-Centered Variables .question[ Calculate the regression model using the mean-centered variables. How did the model change? ] --- ## Quadratic Terms - Sometimes the response variable may have a quadratic relationship with one or more predictor variables - You can see this in a plot of the residuals vs. a predictor variable - Include quadratic terms in the model to capture the relationship - <font class="vocab">Good Practice:</font> Also include all lower order terms This helps with interpretation - You can show quadratic relationships by plotting the predicted mean response for different values of the explanatory variable - The same applies for higher order polynomial relationships --- Below are plots of the residuals versus each quantitative predictor variable. <img src="07-mlr-special-predictors_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> .question[ Which variables (if any) appear to have a quadratic relationship with `Bsal`? ] --- ## Indicator (dummy) variables - Suppose there is a categorical variable with `\(k\)` levels (categories) - Make `\(k\)` indicator (dummy) variables - Use `\(k-1\)` of the indicator variables in the model - Can't uniquely estimate all `\(k\)` variables at once if the intercept is in the model - Level that doesn't have a variable in the model is called the <font class="vocab3">baseline</font> - Coefficients interpreted as the change in the mean of the response over the baseline --- ## Indicator variables when `\(k=2\)` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 5924.007 </td> <td style="text-align:right;"> 99.659 </td> <td style="text-align:right;"> 59.443 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> 5725.925 </td> <td style="text-align:right;"> 6122.090 </td> </tr> <tr> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> -767.913 </td> <td style="text-align:right;"> 128.970 </td> <td style="text-align:right;"> -5.954 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> -1024.255 </td> <td style="text-align:right;"> -511.571 </td> </tr> <tr> <td style="text-align:left;"> SeniorCent </td> <td style="text-align:right;"> -22.582 </td> <td style="text-align:right;"> 5.296 </td> <td style="text-align:right;"> -4.264 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> -33.108 </td> <td style="text-align:right;"> -12.056 </td> </tr> <tr> <td style="text-align:left;"> AgeCent </td> <td style="text-align:right;"> 0.631 </td> <td style="text-align:right;"> 0.721 </td> <td style="text-align:right;"> 0.876 </td> <td style="text-align:right;"> 0.384 </td> <td style="text-align:right;"> -0.801 </td> <td style="text-align:right;"> 2.063 </td> </tr> <tr> <td style="text-align:left;"> EducCent </td> <td style="text-align:right;"> 92.306 </td> <td style="text-align:right;"> 24.864 </td> <td style="text-align:right;"> 3.713 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> 42.887 </td> <td style="text-align:right;"> 141.725 </td> </tr> <tr> <td style="text-align:left;"> ExperCent </td> <td style="text-align:right;"> 0.501 </td> <td style="text-align:right;"> 1.055 </td> <td style="text-align:right;"> 0.474 </td> <td style="text-align:right;"> 0.636 </td> <td style="text-align:right;"> -1.597 </td> <td style="text-align:right;"> 2.598 </td> </tr> </tbody> </table> .question[ - What is the intercept of the model for males? - What is the intercept of the model for females? ] --- ## Indicator variables when `\(k > 2\)` Build a regression model with Education treated as a categorical variable. .question[ - What is the baseline for Education? - Interpret the coefficient for `EducCat16`. - What is your conclusion from the p-value of `EducCat12`? - What is your conclusion from the p-value of `EducCat15`? ] --- ## Interaction Terms - **Case**: Relationship of the predictor variable with the response depends on the value of another predictor variable + This is an <font class="vocab3">interaction effect</font> - Create a new interaction variable that is one predictor variable times the other in the interaction - <font class="vocab">Good Practice:</font> When including an interaction term, also include each variable on its own (called **main effect**) --- ## Interaction effects <img src="07-mlr-special-predictors_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> .question[ Do you think there is a significant interaction effect that includes gender and seniority? Why or why not? ] --- ## Before next class - [Reading 05](https://www2.stat.duke.edu/courses/Spring19/sta210.001/reading/reading-05.html)