Multiple Linear Regression

# Multiple Linear Regression
### Dr. Maria Tackett
### 02.04.19

---

## Announcements

- Reading 04 due Wed, Feb 13

- Lab 02 due Wed, Feb 13

- HW 02 due Mon, Feb 11 (posted after class)

---

## R Packages used in the notes

```r
library(tidyverse)
library(knitr)
library(broom)
library(Sleuth3) # case 1202 dataset
library(cowplot) # use plot_grid function
```

---

## Example: Starting Wages

- In the 1970s Harris Trust and Savings Bank was sued for discrimination on the basis of gender.

- The defense presented an analysis of the salaries for skilled, entry-level clerical employees as evidence.

- **Question**: Did female employees receive lower starting salaries on average than male employees with similar experience and qualifications?

---

## Data

```r
glimpse(wages)
```

```
## Observations: 93
## Variables: 6
## $ Bsal   <int> 5040, 6300, 6000, 6000, 6000, 6840, 8100, 6000, 6000, 690…
## $ Senior <int> 96, 82, 67, 97, 66, 92, 66, 82, 88, 75, 89, 91, 66, 86, 9…
## $ Age    <int> 329, 357, 315, 354, 351, 374, 369, 363, 555, 416, 481, 33…
## $ Educ   <int> 15, 15, 15, 12, 12, 15, 16, 12, 12, 15, 12, 15, 15, 15, 1…
## $ Exper  <dbl> 14.0, 72.0, 35.5, 24.0, 56.0, 41.5, 54.5, 32.0, 252.0, 13…
## $ Female <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, …
```

---

## Variables

**Explanatory**
- <font class="vocab">`Educ`: </font>years of education
- <font class="vocab">`Exper`: </font>months of previous work experience (before hire at bank)
- <font class="vocab">`Female`: </font>1 if female, 0 if male
- <font class="vocab">`Senior`: </font>months worked at bank since hire
- <font class="vocab">`Age`: </font>age in months

**Response**
- <font class="vocab">`Bsal`: </font>annual salary at time of hire

---

## Salary comparison

- <font class="vocab">Question: </font> Did female employees receive lower starting salaries on average than male employees with similar experience and qualifications?

```r
ggplot(data=wages,aes(x=Bsal,fill=Female)) + geom_density(alpha=0.5) + 
  labs(y="Starting Salary", title ="Starting Salary by Gender") 
```

---

### Using ANOVA

`$$\begin{aligned}&H_0: \mu_F = \mu_M\\
&H_a: \mu_F \neq \mu_M\end{aligned}$$`

|term      | df|    sumsq|     meansq| statistic| p.value|
|:---------|--:|--------:|----------:|---------:|-------:|
|Female    |  1| 14045183| 14045183.2|    39.597|       0|
|Residuals | 91| 32278107|   354704.5|        NA|      NA|

- What is a disadvantage to using this method to answer the question?
]

---

## Salary vs. Other Variables

```r
pairs(Bsal ~ Senior + Age + Educ + Exper, data=wages)
```

---

## Multiple Regression Model

- We will calculate a multiple linear regression model with the following form: 
<br>

.small[
`$$\color{blue}{Bsal = 
\beta_0 + \beta_1 \text{Senior} + \beta_2 \text{Age} + \beta_3 \text{Educ} + \beta_4 \text{Exper} + \beta_5 \text{Female}}$$`
]
<br>

- Similar to simple linear regression, this model assumes that at each combination of the predictor variables, the values `Bsal` follow a Normal distribution

---

## Regression Model

- Recall: The simple linear regression model assumes

`$$y|x\sim N(\beta_0 + \beta_1 x, \sigma^2)$$`
--

- Similarly: The multiple linear regression model assumes

`$$\color{blue}{y|x_1, x_2, \ldots, x_p \sim N(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p, \sigma^2)}$$`
--

- For a given observation `$(x_{i1}, x_{i2} \ldots, x_{iP}, y_i)$`, we can rewrite the previous statement as

`$$\color{blue}{y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} + \epsilon_{i} \hspace{5mm} \epsilon_i \sim N(0,\sigma^2)}$$`

---

## Regression Model

- At any combination of `$x's$`, the true mean value of `$y$` is
`$$\color{blue}{y = \beta_0 + \beta_1 x_{1} + \beta_2 x_2 + \dots + \beta_p x_p}$$`
--

- We will use multiple linear regression to estimate the mean
`$$\color{blue}{\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_{1} + \hat{\beta}_2 x_2 + \dots + \hat{\beta}_p x_{p}}$$`

---

## Regression Output

```r
model <- lm(Bsal ~ Senior + Age + Educ + Exper + Female, 
            data=wages)
kable(tidy(model),format="html",digits=3)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 6277.893 </td>
   <td style="text-align:right;"> 652.271 </td>
   <td style="text-align:right;"> 9.625 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Senior </td>
   <td style="text-align:right;"> -22.582 </td>
   <td style="text-align:right;"> 5.296 </td>
   <td style="text-align:right;"> -4.264 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Age </td>
   <td style="text-align:right;"> 0.631 </td>
   <td style="text-align:right;"> 0.721 </td>
   <td style="text-align:right;"> 0.876 </td>
   <td style="text-align:right;"> 0.384 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Educ </td>
   <td style="text-align:right;"> 92.306 </td>
   <td style="text-align:right;"> 24.864 </td>
   <td style="text-align:right;"> 3.713 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Exper </td>
   <td style="text-align:right;"> 0.501 </td>
   <td style="text-align:right;"> 1.055 </td>
   <td style="text-align:right;"> 0.474 </td>
   <td style="text-align:right;"> 0.636 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Female1 </td>
   <td style="text-align:right;"> -767.913 </td>
   <td style="text-align:right;"> 128.970 </td>
   <td style="text-align:right;"> -5.954 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
</tbody>
</table>

---

## Interpreting `$\hat{\beta}_j$`

- An estimated coefficient `$\hat{\beta}_j$` is the amount `$y$` is expected to change when `$x_j$` increases by one unit **holding the values all other predictor variables constant**

- *Example:* The estimated coefficient for `Educ` is 92.31. This means for each additional year of education an employee has, we expect starting salary to increase by about $92.31, holding all other predictor variables constant.

---

### Hypothesis Tests for `$\hat{\beta}_j$`

- We want to test whether a particular coefficient has a value of 0 in the population, given all other variables in the model:

`$$\begin{aligned}&H_0: \beta_j = 0 \\ &H_a: \beta_j \neq 0\end{aligned}$$`

- The test statistic reported in R is the following:

`$$\color{blue}{\text{test statistic} = \frac{\hat{\beta}_j - 0}{SE(\hat{\beta}_j)}}$$`

---

### Hypothesis Tests for `$\hat{\beta}_j$`

- Suppose we want to test the following hypothesis 
`$$H_0: \beta_j = \beta_{j0}$$`

- The test statistic will take the usual form:

`$$\color{blue}{\text{test statistic} = \frac{\hat{\beta}_j - \beta_{j0}}{SE(\hat{\beta}_j)}}$$`

- We calculate the **p-value:** using a `$t$` distribution with `$(n - p - 1)$` degrees of freedom, where `$p$` is the number of predictor variables
    - Note: There are `$(p+1)$` total terms in the model 
  
---

## Salary

.question[
Given the other variables in the model, are the following significant predictors of `Bsal`?

- Education (`Educ`)
- Experience (`Exper`)
]

---

## Salary

.question[
Given the other variables in the model, which variable has a stronger linear relationship with `Bsal` - `Senior` or `Educ`?
]

---

### Confidence Interval for `$\beta_j$`

- We calculate the confidence interval in the usual way

`$$\text{estimate} \pm t^* \times SE$$`
--

- Therefore, the confidence interval for `$\beta_j$` is 
`$$\color{blue}{\hat{\beta}_j \pm t^* SE(\hat{\beta}_j)}$$`

- We calculate `$t^*$` using a `$t$` distribution with `$(n - p - 1)$` degrees of freedom

---

### CI for `Educ`

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
   <th style="text-align:right;"> conf.low </th>
   <th style="text-align:right;"> conf.high </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 6277.893 </td>
   <td style="text-align:right;"> 652.271 </td>
   <td style="text-align:right;"> 9.625 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> 4981.434 </td>
   <td style="text-align:right;"> 7574.353 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Senior </td>
   <td style="text-align:right;"> -22.582 </td>
   <td style="text-align:right;"> 5.296 </td>
   <td style="text-align:right;"> -4.264 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> -33.108 </td>
   <td style="text-align:right;"> -12.056 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Age </td>
   <td style="text-align:right;"> 0.631 </td>
   <td style="text-align:right;"> 0.721 </td>
   <td style="text-align:right;"> 0.876 </td>
   <td style="text-align:right;"> 0.384 </td>
   <td style="text-align:right;"> -0.801 </td>
   <td style="text-align:right;"> 2.063 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Educ </td>
   <td style="text-align:right;"> 92.306 </td>
   <td style="text-align:right;"> 24.864 </td>
   <td style="text-align:right;"> 3.713 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> 42.887 </td>
   <td style="text-align:right;"> 141.725 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Exper </td>
   <td style="text-align:right;"> 0.501 </td>
   <td style="text-align:right;"> 1.055 </td>
   <td style="text-align:right;"> 0.474 </td>
   <td style="text-align:right;"> 0.636 </td>
   <td style="text-align:right;"> -1.597 </td>
   <td style="text-align:right;"> 2.598 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Female1 </td>
   <td style="text-align:right;"> -767.913 </td>
   <td style="text-align:right;"> 128.970 </td>
   <td style="text-align:right;"> -5.954 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> -1024.255 </td>
   <td style="text-align:right;"> -511.571 </td>
  </tr>
</tbody>
</table>

### Notes about CI and Hypothesis Tests

- If the sample size is large enough, the test will probably reject `$H_0: \beta_j=0$`
  + Consider the <font class="vocab">practical significance</font> of the result not just the statistical significance 
 
  
--

- If the sample size is small, there may not be enough evidence to reject `$H_0: \beta_j=0$`
    - When you fail to reject the null hypothesis, **DON'T** immediately conclude that the variable has no association with the response. 
    - There may be a linear association that is just not strong enough to detect given your data, or there may be a non-linear association.

---

## Prediction

- We calculate predictions the same as with simple linear regression

- **Example:** Suppose we want to predict the starting wages for a female who is 28 years old with 12 years of education, 11 months seniority and 2 years of prior experience.

`$$\begin{aligned}\hat{bsal} = & 6277.893 - 22.582 \times \text{Senior} + 0.631 \times \text{Age} \\ &+ 92.306 \times \text{Educ} + 0.501 \times \text{Exper} - 767.913 \times \text{Female}\end{aligned}$$`

```r
6277.893 - 22.582 * 11 + 0.631 * 28 + 92.306 * 12 + 0.501 * 24 - 767.913 * 1
```

```
## [1] 6398.942
```

---

## Prediction

- Just like with simple linear regression, we can use the <font class="vocab">`predict.lm()`</font> function in R to calculate the appropriate intervals for our predicted values

- Suppose we want to predict the starting wages for a female who is 28 years old with 12 years of education, 11 months seniority and 2 years of prior experience.

```r
x0 = data.frame(Senior= 11, Age = 28, Educ = 12, Exper = 24, Female = "1")
predict.lm(model,x0,interval="prediction")
```

```
##       fit      lwr      upr
## 1 6398.93 4967.054 7830.805
```

---

## Prediction

Suppose we want to predict the mean age for the subset of all females who are 28 years old with 12 years of education, 11 months of seniority and 2 years of prior experience.

- How will the interval change? 
]

```r
x0 = data.frame(Senior= 11, Age = 28, Educ = 12, Exper = 24, Female = "1")
*predict.lm(model,x0,interval="confidence")
```

```
##       fit      lwr      upr
## 1 6398.93 5383.844 7414.016
```

---

## Cautions

- <font class="vocab3">Do not extrapolate!</font> Because there are multiple explanatory variables, you can extrapolation in many ways

- The multiple regression model only shows <font class="vocab3"> association, not causality </font>
  + To prove causality, you must have a carefully designed experiment or carefully account for confounding variables in an observational study

---

## Assumptions

---

## Assumptions

The confidence intervals and hypothesis tests are reliable only when the regression assumptions are reasonably satisfied

1. <font class="vocab">Linearity: </font> Response variable has a linear relationship with the explanatory variables in the model

2. <font class="vocab">Constant Variance: </font>The regression variance is the same for all set of predictor variables `$(x_1, \ldots, x_p)$`

3. <font class="vocab">Normality: </font> For a given `$(x_1, \ldots, x_p)$`, the distribution of `$y$` around its mean is Normal

4. <font class="vocab">Independence: </font>All observations are independent

---

## Scatterplots

- Look at a scatterplot of the response variable vs. each of the predictor variables in the exploratory data analysis before calculating the regression model

- This is a good way to check for obvious departures from linearity 
    - Could be an indication that a higher order term or transformation is needed (will discuss this next class)

---

## Residual Plots

- **Plot the residuals vs. the predicted values**
    - Can expose issues such at outliers or nonconstant variance

- **Plot the residuals vs. each of the predictors**
    - Can expose issues between the response and a predictor variable that didn't show in the exploratory data analysis
    - Use boxplots to plot residuals versus categorical predictor variables

- Residual plots should show no systematic pattern

- Plot a histogram and QQ-plot of the residuals to check Normality

---

### Scatterplots

```r
pairs(Bsal ~ Senior + Age + Educ + Exper, data=wages)
```

---

### Residuals vs. Predicted Values

```r
wages <- wages %>% 
  mutate(predicted = predict.lm(model), residuals = resid(model))
ggplot(data=wages,aes(x=predicted, y=residuals)) + 
  geom_point() + 
  geom_hline(yintercept=0,color="red") +
  labs(title="Residuals vs. Predicted Values") 
```

---

### Residuals vs. Predictors

---

### Residuals vs. Predictors

```r
ggplot(data=wages,aes(x=Female,y=residuals)) + 
  geom_boxplot() + 
  geom_hline(yintercept=0,color="red") +
  labs(x = "Female", 
       y="Residuals")
```

---

### Normality of Residuals

.pull-left[
<img src="06-mlr_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" />
]

.pull-right[
<img src="06-mlr_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" />
]

---

## Math  Foundation

---

## Regression Model

- The multiple linear regression model assumes

`$$\color{blue}{y|x_1, \ldots, x_p \sim N(\beta_0 + \beta_1 x_1 + \dots + \beta_p x_p, \sigma^2)}$$`
<br>

- For a given observation `$(x_{i1}, \ldots, x_{ip}, y_i)$`, we can rewrite the previous statement as

`$$\color{blue}{Y_i = \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip} + \epsilon_{i} \hspace{10mm} \epsilon_i \sim N(0,\sigma^2)}$$`

---

## Estimating `$\sigma^2$`

- For a given observation `$(x_{i1},\ldots,x_{ip}, y_i)$` the residual is 
`$$e_i = y_{i} - (\hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \dots + \hat{\beta}_p x_{ip})$$`

- The <font class="vocab">estimated regression variance</font> is

.alert[
`$$\hat{\sigma}^2= \frac{RSS}{n-p-1} = \frac{\sum\limits_{i=1}^n[y_{i} - (\hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \dots + \hat{\beta}_p x_{ip})]^2}{n-p-1}$$`

---

## Calculating `$\hat{\sigma}^2$`

Salary: Estimating `$\hat{\sigma}^2$`

```r
sigma(model)^2
```

```
## [1] 258156
```

```r
kable(tidy(aov(model)),format="html",digits=3)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> df </th>
   <th style="text-align:right;"> sumsq </th>
   <th style="text-align:right;"> meansq </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Senior </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 3784914.70 </td>
   <td style="text-align:right;"> 3784914.70 </td>
   <td style="text-align:right;"> 14.661 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Age </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 17010.44 </td>
   <td style="text-align:right;"> 17010.44 </td>
   <td style="text-align:right;"> 0.066 </td>
   <td style="text-align:right;"> 0.798 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Educ </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 8814046.86 </td>
   <td style="text-align:right;"> 8814046.86 </td>
   <td style="text-align:right;"> 34.142 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Exper </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 2095479.05 </td>
   <td style="text-align:right;"> 2095479.05 </td>
   <td style="text-align:right;"> 8.117 </td>
   <td style="text-align:right;"> 0.005 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 9152264.30 </td>
   <td style="text-align:right;"> 9152264.30 </td>
   <td style="text-align:right;"> 35.452 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Residuals </td>
   <td style="text-align:right;"> 87 </td>
   <td style="text-align:right;"> 22459574.96 </td>
   <td style="text-align:right;"> 258156.03 </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> NA </td>
  </tr>
</tbody>
</table>

---

## Before next class

- [Reading 04](https://www2.stat.duke.edu/courses/Spring19/sta210.001/reading/reading-04.html)