class: center, middle, inverse, title-slide # Multiple Linear Regression ### Dr. Maria Tackett ### 02.04.19 --- ## Announcements - Reading 04 due Wed, Feb 13 - Lab 02 due Wed, Feb 13 - HW 02 due Mon, Feb 11 (posted after class) --- ## R Packages used in the notes ```r library(tidyverse) library(knitr) library(broom) library(Sleuth3) # case 1202 dataset library(cowplot) # use plot_grid function ``` --- ## Example: Starting Wages - In the 1970s Harris Trust and Savings Bank was sued for discrimination on the basis of gender. - The defense presented an analysis of the salaries for skilled, entry-level clerical employees as evidence. - **Question**: Did female employees receive lower starting salaries on average than male employees with similar experience and qualifications? --- ## Data ```r glimpse(wages) ``` ``` ## Observations: 93 ## Variables: 6 ## $ Bsal <int> 5040, 6300, 6000, 6000, 6000, 6840, 8100, 6000, 6000, 690… ## $ Senior <int> 96, 82, 67, 97, 66, 92, 66, 82, 88, 75, 89, 91, 66, 86, 9… ## $ Age <int> 329, 357, 315, 354, 351, 374, 369, 363, 555, 416, 481, 33… ## $ Educ <int> 15, 15, 15, 12, 12, 15, 16, 12, 12, 15, 12, 15, 15, 15, 1… ## $ Exper <dbl> 14.0, 72.0, 35.5, 24.0, 56.0, 41.5, 54.5, 32.0, 252.0, 13… ## $ Female <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, … ``` --- ## Variables **Explanatory** - <font class="vocab">`Educ`: </font>years of education - <font class="vocab">`Exper`: </font>months of previous work experience (before hire at bank) - <font class="vocab">`Female`: </font>1 if female, 0 if male - <font class="vocab">`Senior`: </font>months worked at bank since hire - <font class="vocab">`Age`: </font>age in months **Response** - <font class="vocab">`Bsal`: </font>annual salary at time of hire --- ## Salary comparison - <font class="vocab">Question: </font> Did female employees receive lower starting salaries on average than male employees with similar experience and qualifications? ```r ggplot(data=wages,aes(x=Bsal,fill=Female)) + geom_density(alpha=0.5) + labs(y="Starting Salary", title ="Starting Salary by Gender") ``` <img src="06-mlr_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ### Using ANOVA `$$\begin{aligned}&H_0: \mu_F = \mu_M\\ &H_a: \mu_F \neq \mu_M\end{aligned}$$` |term | df| sumsq| meansq| statistic| p.value| |:---------|--:|--------:|----------:|---------:|-------:| |Female | 1| 14045183| 14045183.2| 39.597| 0| |Residuals | 91| 32278107| 354704.5| NA| NA| .question[ - What's your conclusion? - What is a disadvantage to using this method to answer the question? ] --- ## Salary vs. Other Variables ```r pairs(Bsal ~ Senior + Age + Educ + Exper, data=wages) ``` <img src="06-mlr_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## Multiple Regression Model - We will calculate a multiple linear regression model with the following form: <br> .small[ `$$\color{blue}{Bsal = \beta_0 + \beta_1 \text{Senior} + \beta_2 \text{Age} + \beta_3 \text{Educ} + \beta_4 \text{Exper} + \beta_5 \text{Female}}$$` ] <br> - Similar to simple linear regression, this model assumes that at each combination of the predictor variables, the values `Bsal` follow a Normal distribution --- ## Regression Model - Recall: The simple linear regression model assumes `$$y|x\sim N(\beta_0 + \beta_1 x, \sigma^2)$$` -- - Similarly: The multiple linear regression model assumes `$$\color{blue}{y|x_1, x_2, \ldots, x_p \sim N(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p, \sigma^2)}$$` -- - For a given observation `\((x_{i1}, x_{i2} \ldots, x_{iP}, y_i)\)`, we can rewrite the previous statement as `$$\color{blue}{y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} + \epsilon_{i} \hspace{5mm} \epsilon_i \sim N(0,\sigma^2)}$$` --- ## Regression Model - At any combination of `\(x's\)`, the true mean value of `\(y\)` is `$$\color{blue}{y = \beta_0 + \beta_1 x_{1} + \beta_2 x_2 + \dots + \beta_p x_p}$$` -- - We will use multiple linear regression to estimate the mean `$$\color{blue}{\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_{1} + \hat{\beta}_2 x_2 + \dots + \hat{\beta}_p x_{p}}$$` --- ## Regression Output ```r model <- lm(Bsal ~ Senior + Age + Educ + Exper + Female, data=wages) kable(tidy(model),format="html",digits=3) ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 6277.893 </td> <td style="text-align:right;"> 652.271 </td> <td style="text-align:right;"> 9.625 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Senior </td> <td style="text-align:right;"> -22.582 </td> <td style="text-align:right;"> 5.296 </td> <td style="text-align:right;"> -4.264 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Age </td> <td style="text-align:right;"> 0.631 </td> <td style="text-align:right;"> 0.721 </td> <td style="text-align:right;"> 0.876 </td> <td style="text-align:right;"> 0.384 </td> </tr> <tr> <td style="text-align:left;"> Educ </td> <td style="text-align:right;"> 92.306 </td> <td style="text-align:right;"> 24.864 </td> <td style="text-align:right;"> 3.713 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Exper </td> <td style="text-align:right;"> 0.501 </td> <td style="text-align:right;"> 1.055 </td> <td style="text-align:right;"> 0.474 </td> <td style="text-align:right;"> 0.636 </td> </tr> <tr> <td style="text-align:left;"> Female1 </td> <td style="text-align:right;"> -767.913 </td> <td style="text-align:right;"> 128.970 </td> <td style="text-align:right;"> -5.954 </td> <td style="text-align:right;"> 0.000 </td> </tr> </tbody> </table> --- ## Interpreting `\(\hat{\beta}_j\)` - An estimated coefficient `\(\hat{\beta}_j\)` is the amount `\(y\)` is expected to change when `\(x_j\)` increases by one unit **holding the values all other predictor variables constant** -- - *Example:* The estimated coefficient for `Educ` is 92.31. This means for each additional year of education an employee has, we expect starting salary to increase by about $92.31, holding all other predictor variables constant. --- ### Hypothesis Tests for `\(\hat{\beta}_j\)` - We want to test whether a particular coefficient has a value of 0 in the population, given all other variables in the model: `$$\begin{aligned}&H_0: \beta_j = 0 \\ &H_a: \beta_j \neq 0\end{aligned}$$` - The test statistic reported in R is the following: `$$\color{blue}{\text{test statistic} = \frac{\hat{\beta}_j - 0}{SE(\hat{\beta}_j)}}$$` --- ### Hypothesis Tests for `\(\hat{\beta}_j\)` - Suppose we want to test the following hypothesis `$$H_0: \beta_j = \beta_{j0}$$` - The test statistic will take the usual form: `$$\color{blue}{\text{test statistic} = \frac{\hat{\beta}_j - \beta_{j0}}{SE(\hat{\beta}_j)}}$$` - We calculate the **p-value:** using a `\(t\)` distribution with `\((n - p - 1)\)` degrees of freedom, where `\(p\)` is the number of predictor variables - Note: There are `\((p+1)\)` total terms in the model --- ## Salary <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 6277.893 </td> <td style="text-align:right;"> 652.271 </td> <td style="text-align:right;"> 9.625 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Senior </td> <td style="text-align:right;"> -22.582 </td> <td style="text-align:right;"> 5.296 </td> <td style="text-align:right;"> -4.264 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Age </td> <td style="text-align:right;"> 0.631 </td> <td style="text-align:right;"> 0.721 </td> <td style="text-align:right;"> 0.876 </td> <td style="text-align:right;"> 0.384 </td> </tr> <tr> <td style="text-align:left;"> Educ </td> <td style="text-align:right;"> 92.306 </td> <td style="text-align:right;"> 24.864 </td> <td style="text-align:right;"> 3.713 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Exper </td> <td style="text-align:right;"> 0.501 </td> <td style="text-align:right;"> 1.055 </td> <td style="text-align:right;"> 0.474 </td> <td style="text-align:right;"> 0.636 </td> </tr> <tr> <td style="text-align:left;"> Female1 </td> <td style="text-align:right;"> -767.913 </td> <td style="text-align:right;"> 128.970 </td> <td style="text-align:right;"> -5.954 </td> <td style="text-align:right;"> 0.000 </td> </tr> </tbody> </table> .question[ Given the other variables in the model, are the following significant predictors of `Bsal`? - Education (`Educ`) - Experience (`Exper`) ] --- ## Salary <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 6277.893 </td> <td style="text-align:right;"> 652.271 </td> <td style="text-align:right;"> 9.625 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Senior </td> <td style="text-align:right;"> -22.582 </td> <td style="text-align:right;"> 5.296 </td> <td style="text-align:right;"> -4.264 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Age </td> <td style="text-align:right;"> 0.631 </td> <td style="text-align:right;"> 0.721 </td> <td style="text-align:right;"> 0.876 </td> <td style="text-align:right;"> 0.384 </td> </tr> <tr> <td style="text-align:left;"> Educ </td> <td style="text-align:right;"> 92.306 </td> <td style="text-align:right;"> 24.864 </td> <td style="text-align:right;"> 3.713 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Exper </td> <td style="text-align:right;"> 0.501 </td> <td style="text-align:right;"> 1.055 </td> <td style="text-align:right;"> 0.474 </td> <td style="text-align:right;"> 0.636 </td> </tr> <tr> <td style="text-align:left;"> Female1 </td> <td style="text-align:right;"> -767.913 </td> <td style="text-align:right;"> 128.970 </td> <td style="text-align:right;"> -5.954 </td> <td style="text-align:right;"> 0.000 </td> </tr> </tbody> </table> .question[ Given the other variables in the model, which variable has a stronger linear relationship with `Bsal` - `Senior` or `Educ`? ] --- ### Confidence Interval for `\(\beta_j\)` - We calculate the confidence interval in the usual way `$$\text{estimate} \pm t^* \times SE$$` -- - Therefore, the confidence interval for `\(\beta_j\)` is `$$\color{blue}{\hat{\beta}_j \pm t^* SE(\hat{\beta}_j)}$$` - We calculate `\(t^*\)` using a `\(t\)` distribution with `\((n - p - 1)\)` degrees of freedom --- ### CI for `Educ` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 6277.893 </td> <td style="text-align:right;"> 652.271 </td> <td style="text-align:right;"> 9.625 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> 4981.434 </td> <td style="text-align:right;"> 7574.353 </td> </tr> <tr> <td style="text-align:left;"> Senior </td> <td style="text-align:right;"> -22.582 </td> <td style="text-align:right;"> 5.296 </td> <td style="text-align:right;"> -4.264 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> -33.108 </td> <td style="text-align:right;"> -12.056 </td> </tr> <tr> <td style="text-align:left;"> Age </td> <td style="text-align:right;"> 0.631 </td> <td style="text-align:right;"> 0.721 </td> <td style="text-align:right;"> 0.876 </td> <td style="text-align:right;"> 0.384 </td> <td style="text-align:right;"> -0.801 </td> <td style="text-align:right;"> 2.063 </td> </tr> <tr> <td style="text-align:left;"> Educ </td> <td style="text-align:right;"> 92.306 </td> <td style="text-align:right;"> 24.864 </td> <td style="text-align:right;"> 3.713 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> 42.887 </td> <td style="text-align:right;"> 141.725 </td> </tr> <tr> <td style="text-align:left;"> Exper </td> <td style="text-align:right;"> 0.501 </td> <td style="text-align:right;"> 1.055 </td> <td style="text-align:right;"> 0.474 </td> <td style="text-align:right;"> 0.636 </td> <td style="text-align:right;"> -1.597 </td> <td style="text-align:right;"> 2.598 </td> </tr> <tr> <td style="text-align:left;"> Female1 </td> <td style="text-align:right;"> -767.913 </td> <td style="text-align:right;"> 128.970 </td> <td style="text-align:right;"> -5.954 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> -1024.255 </td> <td style="text-align:right;"> -511.571 </td> </tr> </tbody> </table> .question[ Interpret the 95% confidence interval for the coefficient of `Educ`. ] --- ### Notes about CI and Hypothesis Tests - If the sample size is large enough, the test will probably reject `\(H_0: \beta_j=0\)` + Consider the <font class="vocab">practical significance</font> of the result not just the statistical significance -- - If the sample size is small, there may not be enough evidence to reject `\(H_0: \beta_j=0\)` - When you fail to reject the null hypothesis, **DON'T** immediately conclude that the variable has no association with the response. - There may be a linear association that is just not strong enough to detect given your data, or there may be a non-linear association. --- ## Prediction - We calculate predictions the same as with simple linear regression - **Example:** Suppose we want to predict the starting wages for a female who is 28 years old with 12 years of education, 11 months seniority and 2 years of prior experience. `$$\begin{aligned}\hat{bsal} = & 6277.893 - 22.582 \times \text{Senior} + 0.631 \times \text{Age} \\ &+ 92.306 \times \text{Educ} + 0.501 \times \text{Exper} - 767.913 \times \text{Female}\end{aligned}$$` -- ```r 6277.893 - 22.582 * 11 + 0.631 * 28 + 92.306 * 12 + 0.501 * 24 - 767.913 * 1 ``` ``` ## [1] 6398.942 ``` --- ## Prediction - Just like with simple linear regression, we can use the <font class="vocab">`predict.lm()`</font> function in R to calculate the appropriate intervals for our predicted values -- - Suppose we want to predict the starting wages for a female who is 28 years old with 12 years of education, 11 months seniority and 2 years of prior experience. -- ```r x0 = data.frame(Senior= 11, Age = 28, Educ = 12, Exper = 24, Female = "1") predict.lm(model,x0,interval="prediction") ``` ``` ## fit lwr upr ## 1 6398.93 4967.054 7830.805 ``` --- ## Prediction Suppose we want to predict the mean age for the subset of all females who are 28 years old with 12 years of education, 11 months of seniority and 2 years of prior experience. .question[ - How will the predicted value change? - How will the interval change? ] -- ```r x0 = data.frame(Senior= 11, Age = 28, Educ = 12, Exper = 24, Female = "1") *predict.lm(model,x0,interval="confidence") ``` ``` ## fit lwr upr ## 1 6398.93 5383.844 7414.016 ``` --- ## Cautions - <font class="vocab3">Do not extrapolate!</font> Because there are multiple explanatory variables, you can extrapolation in many ways -- - The multiple regression model only shows <font class="vocab3"> association, not causality </font> + To prove causality, you must have a carefully designed experiment or carefully account for confounding variables in an observational study --- class: middle, center ## Assumptions --- ## Assumptions The confidence intervals and hypothesis tests are reliable only when the regression assumptions are reasonably satisfied 1. <font class="vocab">Linearity: </font> Response variable has a linear relationship with the explanatory variables in the model 2. <font class="vocab">Constant Variance: </font>The regression variance is the same for all set of predictor variables `\((x_1, \ldots, x_p)\)` 3. <font class="vocab">Normality: </font> For a given `\((x_1, \ldots, x_p)\)`, the distribution of `\(y\)` around its mean is Normal 4. <font class="vocab">Independence: </font>All observations are independent --- ## Scatterplots - Look at a scatterplot of the response variable vs. each of the predictor variables in the exploratory data analysis before calculating the regression model - This is a good way to check for obvious departures from linearity - Could be an indication that a higher order term or transformation is needed (will discuss this next class) --- ## Residual Plots - **Plot the residuals vs. the predicted values** - Can expose issues such at outliers or nonconstant variance - **Plot the residuals vs. each of the predictors** - Can expose issues between the response and a predictor variable that didn't show in the exploratory data analysis - Use boxplots to plot residuals versus categorical predictor variables -- - Residual plots should show no systematic pattern - Plot a histogram and QQ-plot of the residuals to check Normality --- ### Scatterplots ```r pairs(Bsal ~ Senior + Age + Educ + Exper, data=wages) ``` <img src="06-mlr_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- ### Residuals vs. Predicted Values ```r wages <- wages %>% mutate(predicted = predict.lm(model), residuals = resid(model)) ggplot(data=wages,aes(x=predicted, y=residuals)) + geom_point() + geom_hline(yintercept=0,color="red") + labs(title="Residuals vs. Predicted Values") ``` <img src="06-mlr_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- ### Residuals vs. Predictors <img src="06-mlr_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- ### Residuals vs. Predictors ```r ggplot(data=wages,aes(x=Female,y=residuals)) + geom_boxplot() + geom_hline(yintercept=0,color="red") + labs(x = "Female", y="Residuals") ``` <img src="06-mlr_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> --- ### Normality of Residuals .pull-left[ <img src="06-mlr_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="06-mlr_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> ] --- class: middle, center ## Math Foundation --- ## Regression Model - The multiple linear regression model assumes `$$\color{blue}{y|x_1, \ldots, x_p \sim N(\beta_0 + \beta_1 x_1 + \dots + \beta_p x_p, \sigma^2)}$$` <br> -- - For a given observation `\((x_{i1}, \ldots, x_{ip}, y_i)\)`, we can rewrite the previous statement as `$$\color{blue}{Y_i = \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip} + \epsilon_{i} \hspace{10mm} \epsilon_i \sim N(0,\sigma^2)}$$` --- ## Estimating `\(\sigma^2\)` - For a given observation `\((x_{i1},\ldots,x_{ip}, y_i)\)` the residual is `$$e_i = y_{i} - (\hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \dots + \hat{\beta}_p x_{ip})$$` -- - The <font class="vocab">estimated regression variance</font> is .alert[ `$$\hat{\sigma}^2= \frac{RSS}{n-p-1} = \frac{\sum\limits_{i=1}^n[y_{i} - (\hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \dots + \hat{\beta}_p x_{ip})]^2}{n-p-1}$$` --- ## Calculating `\(\hat{\sigma}^2\)` Salary: Estimating `\(\hat{\sigma}^2\)` ```r sigma(model)^2 ``` ``` ## [1] 258156 ``` -- ```r kable(tidy(aov(model)),format="html",digits=3) ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> df </th> <th style="text-align:right;"> sumsq </th> <th style="text-align:right;"> meansq </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Senior </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 3784914.70 </td> <td style="text-align:right;"> 3784914.70 </td> <td style="text-align:right;"> 14.661 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Age </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 17010.44 </td> <td style="text-align:right;"> 17010.44 </td> <td style="text-align:right;"> 0.066 </td> <td style="text-align:right;"> 0.798 </td> </tr> <tr> <td style="text-align:left;"> Educ </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 8814046.86 </td> <td style="text-align:right;"> 8814046.86 </td> <td style="text-align:right;"> 34.142 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Exper </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2095479.05 </td> <td style="text-align:right;"> 2095479.05 </td> <td style="text-align:right;"> 8.117 </td> <td style="text-align:right;"> 0.005 </td> </tr> <tr> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 9152264.30 </td> <td style="text-align:right;"> 9152264.30 </td> <td style="text-align:right;"> 35.452 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Residuals </td> <td style="text-align:right;"> 87 </td> <td style="text-align:right;"> 22459574.96 </td> <td style="text-align:right;"> 258156.03 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> </tbody> </table> --- ## Before next class - [Reading 04](https://www2.stat.duke.edu/courses/Spring19/sta210.001/reading/reading-04.html)