Multiple Linear Regression

# Multiple Linear Regression
## Special Predictors
### Dr. Maria Tackett
### 02.06.19

---

## Announcements

- Lab 03 due today

- HW 02 due Mon, Feb 11

- Policy on [regrade requests](https://www2.stat.duke.edu/courses/Spring19/sta210.001/policies.html)

---

## R packages

```r
library(tidyverse)
library(knitr)
library(broom)
library(Sleuth3) # case 1202 dataset
library(cowplot) # use plot_grid function
```

---

## Starting wages data

**Explanatory**
- <font class="vocab">`Educ`: </font>years of Education
- <font class="vocab">`Exper`: </font>months of previous work Experience (before hire at bank)
- <font class="vocab">`Female`: </font>1 if female, 0 if male
- <font class="vocab">`Senior`: </font>months worked at bank since hire
- <font class="vocab">`Age`: </font>Age in months

**Response**
- <font class="vocab">`Bsal`: </font>annual salary at time of hire

---

## Starting wages

```r
wages <- case1202 %>% 
  mutate(Female = ifelse(Sex=="Female",1,0)) %>%
  select(-Sal77,-Sex)
```

```r
glimpse(wages)
```

```
## Observations: 93
## Variables: 6
## $ Bsal   <int> 5040, 6300, 6000, 6000, 6000, 6840, 8100, 6000, 6000, 690…
## $ Senior <int> 96, 82, 67, 97, 66, 92, 66, 82, 88, 75, 89, 91, 66, 86, 9…
## $ Age    <int> 329, 357, 315, 354, 351, 374, 369, 363, 555, 416, 481, 33…
## $ Educ   <int> 15, 15, 15, 12, 12, 15, 16, 12, 12, 15, 12, 15, 15, 15, 1…
## $ Exper  <dbl> 14.0, 72.0, 35.5, 24.0, 56.0, 41.5, 54.5, 32.0, 252.0, 13…
## $ Female <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, …
```

---

## Regression model

```r
model <- lm(Bsal ~ Senior + Age + Educ + Exper + Female, 
            data=wages)
kable(tidy(model,conf.int=TRUE),format="html",digits=3)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
   <th style="text-align:right;"> conf.low </th>
   <th style="text-align:right;"> conf.high </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 6277.893 </td>
   <td style="text-align:right;"> 652.271 </td>
   <td style="text-align:right;"> 9.625 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> 4981.434 </td>
   <td style="text-align:right;"> 7574.353 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Senior </td>
   <td style="text-align:right;"> -22.582 </td>
   <td style="text-align:right;"> 5.296 </td>
   <td style="text-align:right;"> -4.264 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> -33.108 </td>
   <td style="text-align:right;"> -12.056 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Age </td>
   <td style="text-align:right;"> 0.631 </td>
   <td style="text-align:right;"> 0.721 </td>
   <td style="text-align:right;"> 0.876 </td>
   <td style="text-align:right;"> 0.384 </td>
   <td style="text-align:right;"> -0.801 </td>
   <td style="text-align:right;"> 2.063 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Educ </td>
   <td style="text-align:right;"> 92.306 </td>
   <td style="text-align:right;"> 24.864 </td>
   <td style="text-align:right;"> 3.713 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> 42.887 </td>
   <td style="text-align:right;"> 141.725 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Exper </td>
   <td style="text-align:right;"> 0.501 </td>
   <td style="text-align:right;"> 1.055 </td>
   <td style="text-align:right;"> 0.474 </td>
   <td style="text-align:right;"> 0.636 </td>
   <td style="text-align:right;"> -1.597 </td>
   <td style="text-align:right;"> 2.598 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> -767.913 </td>
   <td style="text-align:right;"> 128.970 </td>
   <td style="text-align:right;"> -5.954 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> -1024.255 </td>
   <td style="text-align:right;"> -511.571 </td>
  </tr>
</tbody>
</table>

---

## Math Details

---

## Regression Model

- The multiple linear regression model assumes

`$$\color{blue}{y|x_1, x_2,  \ldots, x_p \sim N(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p, \sigma^2)}$$`
<br>

- For a given observation `$(x_{i1}, x_{i2}, \ldots, x_{iP}, y_i)$`, we can rewrite the previous statement as

`$$\color{blue}{y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} + \epsilon_{i} \hspace{10mm} \epsilon_i \sim N(0,\sigma^2)}$$`

---

## Estimating `$\sigma^2$`

- For a given observation `$(x_{i1}, x_{i2}, \ldots,x_{ip}, y_i)$` the residual is 
`$$\color{blue}{e_i = y_{i} - (\hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \hat{\beta}_{2} x_{i2} + \dots + \hat{\beta}_p x_{ip})}$$`

- The estimated value of the regression variance , `$\sigma^2$`, is

`$$\color{blue}{\hat{\sigma}^2  = \frac{RSS}{n-p-1} = \frac{\sum_{i=1}^ne_i^2}{n-p-1}}$$`

---

# Estimating Coefficients

- One way to estimate the coefficients is by taking partial derivatives of the formula

`$$\color{blue}{\sum_{i=1}^n e_i^2 = \sum_{i=1}^{n}[y_i - (\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} \dots + \beta_p x_{ip})]^2}$$`

- This produces messy formulas, but we can represent the regression equation using matrix notation

---

## Matrix Notation 
 
- `$\mathbf{Y}$`: `$n \times 1$` vector of responses

- `$\mathbf{X}$`: `$n \times (p+1)$` matrix of intercept and predictor variables

- `$\boldsymbol{\beta}$`: `$(p+1) \times 1$` vector of coefficients

`$$\mathbf{Y} = \begin{bmatrix}
    y_{1} \\
    y_{2} \\
    \vdots\\
     y_{n}
\end{bmatrix} \hspace{15mm} \mathbf{X} = \begin{bmatrix}
1& x_{11} & x_{12} & \dots & x_{1p}\\
1&x_{21} & x_{22} & \dots & x_{2p}\\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1&x_{n1} & x_{n2} & \dots & x_{np} \end{bmatrix}\hspace{15mm}\boldsymbol{\beta}=\begin{bmatrix}
    \beta_0 \\
    \beta_1 \\
    \vdots\\
     \beta_p
\end{bmatrix}$$`
<br>

---

## Estimate `$\beta_j$`'s

```r
# Estimate coefficients
y <- wages %>% select(Bsal) %>% as.matrix()
predictors <- wages %>% select(Senior, Age, Educ, Exper, Female) 
intercept <- data.frame(intercept=rep(1,nrow(wages)))
x <- bind_cols(intercept, predictors) %>% as.matrix()
beta <-  solve(t(x)%*%x)%*%t(x)%*%y
beta
```

```
##                   Bsal
## intercept 6277.8933861
## Senior     -22.5823029
## Age          0.6309603
## Educ        92.3060229
## Exper        0.5006397
## Female    -767.9126888
```

---

## Estimate `$\sigma^2$`

.alert[
`$$\hat{\sigma}^2 ~ = ~ \frac{\mathbf{e}^{T}\mathbf{e}}{n-p-1} ~ = ~  \frac{(\mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}})^{T}(\mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}})}{n-p-1}$$`
]

```r
# Estimate sigma.sq
e <-  y - x%*%beta #residuals
sigma.sq <- (t(e)%*%e)/(nrow(wages)-ncol(x))
(sigma.sq <- as.numeric(sigma.sq))
```

```
## [1] 258156
```

---

## Calculate `$SE(\hat{\beta}_j)$`

Take the square root of the diagonal elements to get `$SE(\hat{\beta}_j)$`

```r
var.beta = solve(t(x)%*%x)*sigma.sq
SE = sqrt(diag(var.beta))
SE
```

```
##   intercept      Senior         Age        Educ       Exper      Female 
## 652.2713190   5.2957316   0.7206541  24.8635404   1.0552624 128.9700022
```

---

See [Matrix Form of linear regression notes](https://github.com/STA210-Sp19/supplemental-notes/blob/master/regression-basics-matrix.pdf) for more details about the matrix form of linear regression.

---

## Special Predictors

---

## Interpreting the Intercept

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 6277.893 </td>
   <td style="text-align:right;"> 652.271 </td>
   <td style="text-align:right;"> 9.625 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Senior </td>
   <td style="text-align:right;"> -22.582 </td>
   <td style="text-align:right;"> 5.296 </td>
   <td style="text-align:right;"> -4.264 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Age </td>
   <td style="text-align:right;"> 0.631 </td>
   <td style="text-align:right;"> 0.721 </td>
   <td style="text-align:right;"> 0.876 </td>
   <td style="text-align:right;"> 0.384 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Educ </td>
   <td style="text-align:right;"> 92.306 </td>
   <td style="text-align:right;"> 24.864 </td>
   <td style="text-align:right;"> 3.713 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Exper </td>
   <td style="text-align:right;"> 0.501 </td>
   <td style="text-align:right;"> 1.055 </td>
   <td style="text-align:right;"> 0.474 </td>
   <td style="text-align:right;"> 0.636 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> -767.913 </td>
   <td style="text-align:right;"> 128.970 </td>
   <td style="text-align:right;"> -5.954 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
</tbody>
</table>

- Is this interpretation meaningful? Why or why not?
]
---

### Mean-Centered Variables

- To aid with the interpretation of the intercept, you can subtract each continuous predictor variable from its mean value

- Use these <font class="vocab3">mean-centered</font> variables in the regression model

- Now the intercept is interpreted as the expected (i.e. mean) value of the response at the average value of the predictor variables

---

### Salary: Mean-Centered Variables

```r
wages <- wages %>%
  mutate(SeniorCent = Senior - mean(Senior), 
         AgeCent = Age-mean(Age), 
         EducCent = Educ - mean(Educ), 
         ExperCent = Exper - mean(Exper))
```

.pull-left[
<img src="07-mlr-special-predictors_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" />
]
.pull-right[
<img src="07-mlr-special-predictors_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" />
]

---

### Salary: Mean-Centered Variables

.question[
Calculate the regression model using the mean-centered variables. How did the model change?
]

---

## Quadratic Terms

- Sometimes the response variable may have a quadratic relationship with one or more predictor variables
    - You can see this in a plot of the residuals vs. a predictor variable 
- Include quadratic terms in the model to capture the relationship

- <font class="vocab">Good Practice:</font> Also include all lower order terms 
This helps with interpretation 
  
- You can show quadratic relationships by plotting the predicted mean response for different values of the explanatory variable

- The same applies for higher order polynomial relationships

---

Below are plots of the residuals versus each quantitative predictor variable.

---

## Indicator (dummy) variables

- Suppose there is a categorical variable with `$k$` levels (categories)

- Make `$k$` indicator (dummy) variables

- Use `$k-1$` of the indicator variables in the model
    - Can't uniquely estimate all `$k$` variables at once if the intercept is in the model
    
- Level that doesn't have a variable in the model is called the <font class="vocab3">baseline</font>

- Coefficients interpreted as the change in the mean of the response over the baseline

---

## Indicator variables when `$k=2$`

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
   <th style="text-align:right;"> conf.low </th>
   <th style="text-align:right;"> conf.high </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 5924.007 </td>
   <td style="text-align:right;"> 99.659 </td>
   <td style="text-align:right;"> 59.443 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> 5725.925 </td>
   <td style="text-align:right;"> 6122.090 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> -767.913 </td>
   <td style="text-align:right;"> 128.970 </td>
   <td style="text-align:right;"> -5.954 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> -1024.255 </td>
   <td style="text-align:right;"> -511.571 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> SeniorCent </td>
   <td style="text-align:right;"> -22.582 </td>
   <td style="text-align:right;"> 5.296 </td>
   <td style="text-align:right;"> -4.264 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> -33.108 </td>
   <td style="text-align:right;"> -12.056 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> AgeCent </td>
   <td style="text-align:right;"> 0.631 </td>
   <td style="text-align:right;"> 0.721 </td>
   <td style="text-align:right;"> 0.876 </td>
   <td style="text-align:right;"> 0.384 </td>
   <td style="text-align:right;"> -0.801 </td>
   <td style="text-align:right;"> 2.063 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> EducCent </td>
   <td style="text-align:right;"> 92.306 </td>
   <td style="text-align:right;"> 24.864 </td>
   <td style="text-align:right;"> 3.713 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> 42.887 </td>
   <td style="text-align:right;"> 141.725 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ExperCent </td>
   <td style="text-align:right;"> 0.501 </td>
   <td style="text-align:right;"> 1.055 </td>
   <td style="text-align:right;"> 0.474 </td>
   <td style="text-align:right;"> 0.636 </td>
   <td style="text-align:right;"> -1.597 </td>
   <td style="text-align:right;"> 2.598 </td>
  </tr>
</tbody>
</table>

- What is the intercept of the model for females?
]

---

## Indicator variables when `$k > 2$`

Build a regression model with Education treated as a categorical variable.

- Interpret the coefficient for `EducCat16`.

- What is your conclusion from the p-value of `EducCat12`?

- What is your conclusion from the p-value of `EducCat15`?
]
 
 
---

## Interaction Terms

- **Case**: Relationship of the predictor variable with the response depends on the value of another predictor variable
  + This is an <font class="vocab3">interaction effect</font>
  
- Create a new interaction variable that is one predictor variable times the other in the interaction

- <font class="vocab">Good Practice:</font> When including an interaction term, also include each variable on its own (called **main effect**)

---

## Interaction effects

.question[
Do you think there is a significant interaction effect that includes gender and seniority? Why or why not?
]

---

## Before next class

- [Reading 05](https://www2.stat.duke.edu/courses/Spring19/sta210.001/reading/reading-05.html)