Analysis of Variance

# Analysis of Variance
### Dr. Maria Tackett
### 01.30.19

---

## Announcements

- HW 01 and Lab 02 due **today at 11:59p**

- Make sure all files are updated to the final version on GitHub

---

## R Packages

```r
library(tidyverse)
library(broom)
library(knitr)
```

---

## Beer Data

```r
beer <- read_csv("data/beer.csv") %>%
  filter(!is.na(Carbohydrates), !is.na(PercentAlcohol))
```

- We remove observations with missing data, so the dataset contains information about 159 domestic beers

- `Brand`
    - `Brewery`
    - `PercentAlcohol`
    - `CaloriesPer12Oz`
    - `Carbohydrates` (in grams)
---

### `Carbohydrates` vs. `PercentAlcohol`

```r
model <- lm(Carbohydrates ~ PercentAlcohol, data=beer)
kable(model %>% tidy(conf.int=TRUE),format="html",digits=3)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
   <th style="text-align:right;"> conf.low </th>
   <th style="text-align:right;"> conf.high </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 2.642 </td>
   <td style="text-align:right;"> 1.311 </td>
   <td style="text-align:right;"> 2.015 </td>
   <td style="text-align:right;"> 0.046 </td>
   <td style="text-align:right;"> 0.052 </td>
   <td style="text-align:right;"> 5.231 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> PercentAlcohol </td>
   <td style="text-align:right;"> 1.799 </td>
   <td style="text-align:right;"> 0.242 </td>
   <td style="text-align:right;"> 7.443 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> 1.322 </td>
   <td style="text-align:right;"> 2.277 </td>
  </tr>
</tbody>
</table>

---

### Distribution of  `Carbohydrates`

```r
ggplot(data=beer,mapping=aes(x=Carbohydrates)) +
  geom_histogram() 
```

**Goal:** To understand the variation in `Carbohydrates` 
---

## ANOVA

The	<font class="vocab">Analysis of Variance (ANOVA)</font> approach	to	regression is used to answer the following:  
  - how	much	of	the	variation	in	the	response variable	can	be	attributed	to	the	regression	model
  -  how	much of the	variation	in	the	response can be attributed to factors not included in the model.
<br><br>

|  | Sum of Squares | DF | Mean Square | F-Stat| p-value |
|------------------|----------------|--------------------|-------------|-------------|--------------------|
| Regression (Model) | `$\sum\limits_{i=1}^{n}(\hat{y}_i - \bar{y})^2$` | 1 | `$MSS/1$` | `$MMS/RMS$` | `$P(F > \text{F-Stat})$` |
| Residual | `$\sum\limits_{i=1}^{n}(y_i - \hat{y}_i)^2$` | `$n-2$` | `$RSS$` / `$(n-2)$` |  |  |
| Total | `$\sum\limits_{i=1}^{n}(y_i - \bar{y})^2$` | `$n-1$` | `$TSS$` / `$(n-1)$` |  |  |

---

## Total variation

The total amount of variation in the `$n$` observed values of `$y$` is the <font class="vocab">Total Sum of Squares</font>. You can think of this as the variation about the sample mean, `$\bar{y}$`.

`$$TSS = \sum\limits_{i=1}^n (y_i - \bar{y})^2 = (n-1)s_y^2$$`

---

## Regression variation

The variation of the predicted values about the overall mean is the  <font class="vocab">Regression (Model) Sum of Squares</font>. You can think of this as how much information about `$y$` is explained by its relationship with `$x$`.

`$$MSS = \sum\limits_{i=1}^n (\hat{y}_i - \bar{y})^2$$`

---

## Residual variation

The amount of variation in the responses for a given value of `$x$` is the <font class="vocab">Residual Sum of Squares</font>. You can think of this as the variation in the response values about the line if you know the value of `$x$`.

`$$RSS = \sum\limits_{i=1}^n (y_i- \hat{y}_i)^2$$`

---

## `$R^2$`

- We have already used the sum of squares in the ANOVA table to calculate `$R^2$`, the proportion of variation in the response that is explained by variation in the predictor variable, i.e. the regression line.
<br><br>

.instructions[
`$$R^2 = \frac{\text{TSS} - \text{RSS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}}$$`
]
---

## Mean square

The <font class="vocab">mean square (MS)</font> is calculated by dividing a sum of squares by its associated degrees of freedom 
<br><br>

`$\text{Total MS} = \frac{\text{TSS}}{n-1} = s_y^2$`
<br><br>

`$\text{Regression (Model) MS} = \frac{\text{MSS}}{1} = R^2(n-1)s_y^2$`
<br><br>

`$\text{Residual MS} = \frac{\text{RSS}}{n-2} = \hat{\sigma}^2$`

---

## Test for `$\beta_1$`

- We want to test if there is a significant linear relationship between the response and predictor variable

`$$\begin{aligned}&H_0: \beta_1 = 0\\
&H_a: \beta_1 \neq 0\\\end{aligned}$$`

- In the previous class, we tested these hypotheses using inference about the slope
 
- Today we will test the hypotheses using analysis of variance. 
    - This approach will be especially useful for multiple linear regression.
    
---

## Test statistic

- The <font class="vocab3">test statistic </font> for ANOVA is

`$$\text{F-statistic} = \frac{\text{Regression (Model) MS}}{\text{Residual MS}} = \frac{\text{MSS}/1}{{\text{RSS}}/(n-2)}$$`
<br>

- **Main idea:** If there is a significant linear relationship between `$x$` and `$y$`, the Regression MS will be much larger than the Residual MS. Therefore, a large F-statistic is evidence against `$H_0$`.

---

## p-value

- The test statistic under this model follows an `$F$` distribution with `$1$` and `$n-2$` degrees of freedom

- The `$F$` distribution is right-tailed, so to calculate the p-value, we use

`$$\text{p-value} = P(F \geq \text{ F-statistic})$$`

---

## ANOVA Table

---

### ANOVA for beer data

```r
kable(tidy(aov(model)),format="html",digits=3)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> df </th>
   <th style="text-align:right;"> sumsq </th>
   <th style="text-align:right;"> meansq </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> PercentAlcohol </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1009.861 </td>
   <td style="text-align:right;"> 1009.861 </td>
   <td style="text-align:right;"> 55.397 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Residuals </td>
   <td style="text-align:right;"> 157 </td>
   <td style="text-align:right;"> 2862.021 </td>
   <td style="text-align:right;"> 18.229 </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> NA </td>
  </tr>
</tbody>
</table>
<br>

- What is `$\hat{\sigma}^2$`, the estimated regression variance?

- Calculate and interpret `$R^2$`. 
]

---

## Using ANOVA to compare group means

---

Thus far, we have used a quantitative explanatory variable to understand the variation in a quantitative response variable.

Now, we will use a **categorical (qualtitatve) explanatory variable** to understand the variation in a quantitative response variable.

---

## Carbohydrates vs. Brewery

- Is there a relationship between the brewery and the total carbohydrates in beers from the top 5 domestic breweries?

```r
top5_brewery <- c("MillerCoors","Flying Dog Brewery", 
                  "Anheuser Busch",  "Sierra Nevada", "Budweiser")
beer <- beer %>%
  filter(Brewery %in% top5_brewery)
```

---

## Carbohydrates vs. Brewery

```r
ggplot(data=beer,mapping=aes(x=Brewery,y=Carbohydrates)) +
  geom_boxplot()
```

---

##Notation

- `$K$` is number of distinct groups. We index the groups as `$k=1,\dots, K$`.
<br>

- `$n_k$` is number of observations in group `$k$`
<br>

- `$n=n_1 + n_2 + \dots + n_K$` is the total number of observations
<br>

--
  
- `$y_{kj}$` is the `$j^{th}$` observation in group `$k$`, for all `$k,j$`
<br>

- `$\mu_k$` is the population mean for group `$k$`, for `$k=1,\dots,K$`

---

## Set Up

- Is there a significant relationship between the explanatory variable `$x$` and the response variable `$y$`?

- <font class="vocab">Goal: </font> Test whether the group means are equal 
--

- ANOVA model assumes
`$$y_{kj} = \mu_{k} + \epsilon_{kj}$$`
such that `$\epsilon_{kj}$` is the amount `$y_{kj}$` deviates from `$\mu_k$`
--

- <font class="vocab">Assumption: </font> `$\epsilon_{kj}$` follows a Normal distribution with mean 0 and constant variance `$\sigma^2$`, i.e. `$\epsilon_{kj} \sim N(0,\sigma^2)$`
  + Another viewpoint: `$y_{kj} \sim N(\mu_k, \sigma^2)$`

---

## Hypotheses

- <font class="vocab">General Question: </font> Is there a significant difference in the means across the `$K$` groups?
--

- In other words, we want to test the following hypotheses:

$$
`\begin{aligned}
&H_0: \mu_1 = \mu_2 = \dots =  \mu_K\\
&H_a: \text{At least one }\mu_i \text{ is not equal to the others}
\end{aligned}`
$$
--

- If the sample means are "far apart", " there is evidence against `$H_0$`
--

- We will calculate a test statistic to assess "far apart" in the context of the data
 
---

## ANOVA

- **Main Idea: ** Decompose the <font color="green">total variation</font> in the data into the variation <font color="blue">between groups</font> and the variation <font color="red">within each group</font>

`$$\color{green}{\sum_{k=1}^{K}\sum_{j=1}^{n_k}(y_{kj}- \bar{y})^2}=\color{blue}{\sum_{k=1}^{K}n_k(\bar{y}_k-\bar{y})^2} + \color{red}{\sum_{k=1}^{K}\sum_{j=1}^{n_k}(y_{kj}-\bar{y}_k)^2}$$`
<br>

- If the variation <font color="blue">between groups</font> is significantly greater than the variation <font color="red">within each group</font>, then there is evidence against the null hypothesis.

---

## ANOVA table for comparing means

.small[
|  | Sum of Squares | DF | Mean Square | F-Stat| p-value |
|------------------|----------------|--------------------|-------------|-------------|--------------------|
| Between (Model) | `$\sum\limits_{k=1}^{K}n_k(\bar{y}_k-\bar{y})^2$` | `$K-1$` | `$SSB/(K-1)$` | `$MSB/MSW$` | `$P(F > \text{F-Stat})$` |
| Within (Residual) | `$\sum\limits_{k=1}^{K}\sum\limits_{j=1}^{n_k}(y_{kj}-\bar{y}_k)^2$` | `$n-K$` | `$SSW/(n-K)$` |  |  |
| Total | `$\sum\limits_{k=1}^{K}\sum\limits_{j=1}^{n_k}(y_{kj}-\bar{y})^2$` | `$n-1$` | `$SST/(n-1)$` |  |  |
]

---

## Total

- Variation between and within groups

`$$SST =\sum_{k=1}^{K}\sum_{j=1}^{n_k}(y_{kj}-\bar{y})^2$$`
--

- <font class="vocab">Degrees of freedom</font> 
`$$DFT = n-1$$`
--

- Estimate of the variance across all observations: 
`$$\frac{SST}{DFT} = \frac{\sum_{k=1}^{K}\sum_{j=1}^{n_k}(y_{kj}-\bar{y})^2}{n-1}$$`

---

## Between (Model)

- Variation in the group means

`$$SSB = \sum_{k=1}^{K}n_k(\bar{y}_k-\bar{y})^2$$`
--

- <font class="vocab">Degrees of freedom</font> 
`$$DFB = K-1$$`
--

- <font class="vocab">Mean Squares Between</font>
`$$MSB = \frac{SSB}{DFB} = \frac{\sum_{k=1}^{K}n_i(\bar{y}_k-\bar{y})^2}{K-1}$$`
--

- MSB Estimates the variance across the `$\mu_i$`'s
  
  
---

## Within (Residual)

- Variation within each group

`$$SSW=\sum_{k=1}^{K}\sum_{j=1}^{n_k}(y_{kj}-\bar{y}_k)^2$$`
--

- <font class="vocab">Degrees of freedom</font> 
`$$DFW = n-K$$`
--

- <font class="vocab">Mean Squares Within</font>
`$$MSW = \frac{SSW}{DFW} = \frac{\sum_{k=1}^{K}\sum_{j=1}^{n_k}(y_{kj}-\bar{y}_i)^2}{n-K}$$`
--

- MSW is the estimate of `$\sigma^2$`, the variance within each group

---

## Carbohydrates vs. Brewery

```r
model_brewery <- lm(Carbohydrates ~ Brewery,data=beer)
kable(tidy(aov(model_brewery)),format="html",digits=3)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> df </th>
   <th style="text-align:right;"> sumsq </th>
   <th style="text-align:right;"> meansq </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Brewery </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 962.055 </td>
   <td style="text-align:right;"> 240.514 </td>
   <td style="text-align:right;"> 14.02 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Residuals </td>
   <td style="text-align:right;"> 88 </td>
   <td style="text-align:right;"> 1509.696 </td>
   <td style="text-align:right;"> 17.156 </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> NA </td>
  </tr>
</tbody>
</table>

- What is `$\hat{\sigma}^2$`, the estimated variance within each group?

- State the null and alternative hypothesis for this test. What is your conclusion?
]

---

## Assumptions for ANOVA

- <font class="vocab">Normality: </font> `$y_{kj} \sim N(\mu_k, \sigma^2)$`

- <font class="vocab">Independence: </font> There is independence within and across groups

- <font class="vocab">Constant Variance: </font> The population distribution for each group has a common variance, `$\sigma^2$`

- We can typically check these assumptions in the exploratory data analysis
---

## Before next class

- [Reading 03](https://www2.stat.duke.edu/courses/Spring19/sta210.001/reading/reading-03.html) due Monday

- Understand inference and ANOVA for simple linear regression!