Model Assessment

class: center, middle, inverse, title-slide

# Model Assessment
### Dr. Maria Tackett
### 02.13.19

---

## Announcements

- Lab 04 due today

- HW 03 due Monday, Feb 18

---

## R packages

```r
library(tidyverse)
library(knitr)
library(broom)
library(cowplot) # use plot_grid function
```

---

## Restaurant tips

What affects the amount customers tip at a restaurant?

- **Response:**
    - <font class="vocab">`Tip`</font>: amount of the tip
    
- **Predictors:**
    - <font class="vocab">`Party`</font>: number of people in the party
    - <font class="vocab">`Meal`</font>:  time of day (Lunch, Dinner, Late Night) 
    - <font class="vocab">`Age`</font>: age category of person paying the bill (Yadult, Middle, SenCit)

```r
tips <- read_csv("data/tip-data.csv") %>%
  filter(!is.na(Party))
```

---

## ANOVA table for regression

We can use the Analysis of Variance (ANOVA) table to decompose the variability in our response variable

|  | Sum of Squares | DF | Mean Square | F-Stat| p-value |
|------------------|----------------|--------------------|-------------|-------------|--------------------|
| Regression (Model) | `$$\sum\limits_{i=1}^{n}(\hat{y}_i - \bar{y})^2$$` | `$$p$$` | `$$\frac{MSS}{p}$$` | `$$\frac{MMS}{RMS}$$` | `$$P(F > \text{F-Stat})$$` |
| Residual | `$$\sum\limits_{i=1}^{n}(y_i - \hat{y}_i)^2$$` | `$$n-p-1$$` | `$$\frac{RSS}{n-p-1}$$` |  |  |
| Total | `$$\sum\limits_{i=1}^{n}(y_i - \bar{y})^2$$` | `$$n-1$$` | `$$\frac{TSS}{n-1}$$` |  |  |

The estimate of the regression variance, `$\hat{\sigma}^2 = RMS$`

---

## `$R^2$`

- **Recall**: `$R^2$` is the proportion of the variation in the response variable explained by the regression model
<br>

- `$R^2$` will always increase as we add more variables to the model 
  + If we add enough variables, we can always achieve `$R^2=100\%$`
<br>

- If we only use `$R^2$` to choose a best fit model, we will be prone to choose the model with the most predictor variables

---

## Adjusted `$R^2$`

- <font class="vocab">Adjusted `$R^2$`</font>: a version of `$R^2$` that penalizes for unnecessary predictor variables
<br>

- Similar to `$R^2$`, it measures the proportion of variation in the response that is explained by the regression model 
<br>

- Differs from `$R^2$` by using the mean squares rather than sums of squares and therefore adjusting for the number of predictor variables

---

## `$R^2$` and Adjusted `$R^2$`

`$$R^2 = \frac{\text{Total Sum of Squares} - \text{Residual Sum of Squares}}{\text{Total Sum of Squares}}$$`
<br>

.alert[
`$$Adj. R^2 = \frac{\text{Total Mean Square} - \text{Residual Mean Square}}{\text{Total Mean Square}}$$`
]
<br>

- `$Adj. R^2$` can be used as a quick assessment to compare the fit of multiple models; however, it should not be the only assessment!

- Use `$R^2$` when describing the relationship between the response and predictor variables

---

### Restaurant tips: model

```r
model1 <- lm(Tip ~ Party + Meal + Age , data = tips)
kable(tidy(model1),format="html",digits=3)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 1.254 </td>
   <td style="text-align:right;"> 0.394 </td>
   <td style="text-align:right;"> 3.182 </td>
   <td style="text-align:right;"> 0.002 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Party </td>
   <td style="text-align:right;"> 1.808 </td>
   <td style="text-align:right;"> 0.121 </td>
   <td style="text-align:right;"> 14.909 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> MealLate Night </td>
   <td style="text-align:right;"> -1.632 </td>
   <td style="text-align:right;"> 0.407 </td>
   <td style="text-align:right;"> -4.013 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> MealLunch </td>
   <td style="text-align:right;"> -0.612 </td>
   <td style="text-align:right;"> 0.402 </td>
   <td style="text-align:right;"> -1.523 </td>
   <td style="text-align:right;"> 0.130 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> AgeSenCit </td>
   <td style="text-align:right;"> 0.390 </td>
   <td style="text-align:right;"> 0.394 </td>
   <td style="text-align:right;"> 0.990 </td>
   <td style="text-align:right;"> 0.324 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> AgeYadult </td>
   <td style="text-align:right;"> -0.505 </td>
   <td style="text-align:right;"> 0.412 </td>
   <td style="text-align:right;"> -1.227 </td>
   <td style="text-align:right;"> 0.222 </td>
  </tr>
</tbody>
</table>

---

## Restaurant tips: ANOVA

- <font class="vocab">R output</font>

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> Df </th>
   <th style="text-align:right;"> Sum Sq </th>
   <th style="text-align:right;"> Mean Sq </th>
   <th style="text-align:right;"> F value </th>
   <th style="text-align:right;"> Pr(&gt;F) </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Party </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1188.636 </td>
   <td style="text-align:right;"> 1188.636 </td>
   <td style="text-align:right;"> 311.002 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Meal </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 88.460 </td>
   <td style="text-align:right;"> 44.230 </td>
   <td style="text-align:right;"> 11.573 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Age </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 13.032 </td>
   <td style="text-align:right;"> 6.516 </td>
   <td style="text-align:right;"> 1.705 </td>
   <td style="text-align:right;"> 0.185 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Residuals </td>
   <td style="text-align:right;"> 163 </td>
   <td style="text-align:right;"> 622.979 </td>
   <td style="text-align:right;"> 3.822 </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> NA </td>
  </tr>
</tbody>
</table>

- <font class="vocab">ANOVA table</font>

---

### Calculating `$R^2$` and Adj `$R^2$`

```r
#r-squared
tss <- 1188.63588 + 88.46005 + 13.03236 + 622.97932	
rss <- 622.97932	
(r_sq <- (tss - rss)/tss)
```

```
## [1] 0.6743626
```

```r
#adj r-squared
tms <- tss/(nrow(tips)-1)
rms <- 3.821959	
(adj_r_sq <- (tms - rms)/tms)
```

```
## [1] 0.6643738
```

---

### Restaurant tips: `$R^2$` and Adj. `$R^2$`

```r
glance(model1)
```

```
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>  <dbl> <dbl> <dbl>
## 1     0.674         0.664  1.95      67.5 6.14e-38     6  -350.  714.  736.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
```
<br>

- Close values of `$R^2$` and Adjusted `$R^2$` indicate that the variables in the model are significant in understanding variation in tips

---

## ANOVA F Test

- Using the ANOVA table, we can test whether any variable in the model is a significant predictor of the response. We conduct this test using the following hypotheses:

.alert[
`$$\begin{aligned}&H_0: \beta_{1} =  \beta_{2} = \dots = \beta_p = 0 \\ 
&H_a: \text{at least one }\beta_j \text{ is not equal to 0}\end{aligned}$$`
]

<br>

- The statistic for this test is the `$F$` test statistic in the ANOVA table

- We calculate the p-value using an `$F$` distribution with `$p$` and `$(n-p-1)$` degrees of freedom

---

## ANOVA F Test in R

```r
model0 <- lm(Tip ~ 1, data=tips)
```

```r
model1 <- lm(Tip ~ Party + Meal + Age , data = tips)
```

```r
kable(anova(model0,model1),format="html")
```

**At least one coefficient is non-zero, i.e. at least one predictor in the model is significant**
---

### Testing subset of coefficients

- Sometimes we want to test whether a subset of coefficients are all equal to 0

- This is often the case when we want test 
    - whether a categorical variable with `$k$` levels is a significant predictor of the response
    - whether the interaction between a categorical and quantitative variable is significant

- To do so, we will use the  <font class="vocab3">Nested F Test </font>

---

## Nested F Test

- Suppose we have a full and reduced model:

`$$\begin{aligned}&\text{Full}: y = \beta_0 + \beta_1 x_1 + \dots + \beta_q x_q + \beta_{q+1} x_{q+1} + \dots \beta_p x_p \\
&\text{Red}: y = \beta_0 + \beta_1 x_1 + \dots + \beta_q x_q\end{aligned}$$`
<br>

- We want to test whether any of the variables `$x_{q+1}, x_{q+2}, \ldots, x_p$` are significant predictors. To do so, we will test the hypothesis:

.alert[
`$$\begin{aligned}&H_0: \beta_{q+1} =  \beta_{q+2} = \dots = \beta_p = 0 \\ 
&H_a: \text{at least one }\beta_j \text{ is not equal to 0}\end{aligned}$$`
]

---

## Nested F Test

- The test statistic for this test is

`$$F = \frac{(RSS_{reduced} - RSS_{full})\big/(p_{full} - p_{reduced})}{RSS_{full}\big/(n-p_{full}-1)}$$`
<br>

- Calculate the p-value using the F distribution with `$(p_{full} - p_{reduced})$` and `$(n-p_{full}-1)$` degrees of freedom

---

### Is `Meal` a significant predictor of tips?

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 1.254 </td>
   <td style="text-align:right;"> 0.394 </td>
   <td style="text-align:right;"> 3.182 </td>
   <td style="text-align:right;"> 0.002 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Party </td>
   <td style="text-align:right;"> 1.808 </td>
   <td style="text-align:right;"> 0.121 </td>
   <td style="text-align:right;"> 14.909 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> AgeSenCit </td>
   <td style="text-align:right;"> 0.390 </td>
   <td style="text-align:right;"> 0.394 </td>
   <td style="text-align:right;"> 0.990 </td>
   <td style="text-align:right;"> 0.324 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> AgeYadult </td>
   <td style="text-align:right;"> -0.505 </td>
   <td style="text-align:right;"> 0.412 </td>
   <td style="text-align:right;"> -1.227 </td>
   <td style="text-align:right;"> 0.222 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> MealLate Night </td>
   <td style="text-align:right;"> -1.632 </td>
   <td style="text-align:right;"> 0.407 </td>
   <td style="text-align:right;"> -4.013 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> MealLunch </td>
   <td style="text-align:right;"> -0.612 </td>
   <td style="text-align:right;"> 0.402 </td>
   <td style="text-align:right;"> -1.523 </td>
   <td style="text-align:right;"> 0.130 </td>
  </tr>
</tbody>
</table>

---

### Tips data:  Nested F Test

`$$\begin{aligned}&H_0: \beta_{late night} = \beta_{lunch} = 0\\
&H_a: \text{ at least one }\beta_j \text{ is not equal to 0}\end{aligned}$$`

```r
reduced <- lm(Tip ~ Party + Age, data = tips)
```

```r
full <- lm(Tip ~ Party + Age + Meal, data = tips)
```

```r
kable(anova(full,reduced),format="html")
```

**At least one coefficient associated with `Meal` is not zero. Therefore, `Meal` is a significant predictor of `Tips`.**

---

class: middle

.question[
Why is it not good practice to use the individual p-values to determine a categorical variable with `$k > 2$` levels) is significant? *Hint*: What does it actually mean if none of the `$k-1$` p-values are significant?
]

---

## Practice with Interactions

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 1.2764989 </td>
   <td style="text-align:right;"> 0.4910882 </td>
   <td style="text-align:right;"> 2.5993270 </td>
   <td style="text-align:right;"> 0.0102086 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Party </td>
   <td style="text-align:right;"> 1.7947980 </td>
   <td style="text-align:right;"> 0.1715003 </td>
   <td style="text-align:right;"> 10.4652753 </td>
   <td style="text-align:right;"> 0.0000000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> AgeSenCit </td>
   <td style="text-align:right;"> 0.4007889 </td>
   <td style="text-align:right;"> 0.3969295 </td>
   <td style="text-align:right;"> 1.0097230 </td>
   <td style="text-align:right;"> 0.3141431 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> AgeYadult </td>
   <td style="text-align:right;"> -0.4701634 </td>
   <td style="text-align:right;"> 0.4197146 </td>
   <td style="text-align:right;"> -1.1201978 </td>
   <td style="text-align:right;"> 0.2642977 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> MealLate Night </td>
   <td style="text-align:right;"> -1.8454674 </td>
   <td style="text-align:right;"> 0.7089728 </td>
   <td style="text-align:right;"> -2.6030159 </td>
   <td style="text-align:right;"> 0.0101039 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> MealLunch </td>
   <td style="text-align:right;"> -0.4608832 </td>
   <td style="text-align:right;"> 0.8651044 </td>
   <td style="text-align:right;"> -0.5327487 </td>
   <td style="text-align:right;"> 0.5949421 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Party:MealLate Night </td>
   <td style="text-align:right;"> 0.1108600 </td>
   <td style="text-align:right;"> 0.2846584 </td>
   <td style="text-align:right;"> 0.3894491 </td>
   <td style="text-align:right;"> 0.6974586 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Party:MealLunch </td>
   <td style="text-align:right;"> -0.0500822 </td>
   <td style="text-align:right;"> 0.2825586 </td>
   <td style="text-align:right;"> -0.1772455 </td>
   <td style="text-align:right;"> 0.8595384 </td>
  </tr>
</tbody>
</table>

.question[
1. Write the general form of the model. 
2. Write the model for `Meal == "Late Night"`. 
3. How does the mean change when `Meal == "Late Night"`? 
4. How does the slope of `Party` change when `Meal == "Late Night"`?
]
---

### Nested F test for interactions

**Is the interaction between `Party` and `Meal` significant?**

```r
reduced <- lm(Tip ~ Party + Age + Meal, data = tips)
```

```r
full <- lm(Tip ~ Party + Age + Meal + Meal*Party, data = tips)
```

```r
kable(anova(full,reduced),format="html")
```