class: center, middle, inverse, title-slide # Analysis of Variance ### Dr. Maria Tackett ### 01.30.19 --- class: regular ## Announcements - HW 01 and Lab 02 due **today at 11:59p** - Make sure all files are updated to the final version on GitHub --- ## R Packages ```r library(tidyverse) library(broom) library(knitr) ``` --- ## Beer Data ```r beer <- read_csv("data/beer.csv") %>% filter(!is.na(Carbohydrates), !is.na(PercentAlcohol)) ``` - We remove observations with missing data, so the dataset contains information about 159 domestic beers - `Brand` - `Brewery` - `PercentAlcohol` - `CaloriesPer12Oz` - `Carbohydrates` (in grams) --- ### `Carbohydrates` vs. `PercentAlcohol` ```r model <- lm(Carbohydrates ~ PercentAlcohol, data=beer) kable(model %>% tidy(conf.int=TRUE),format="html",digits=3) ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 2.642 </td> <td style="text-align:right;"> 1.311 </td> <td style="text-align:right;"> 2.015 </td> <td style="text-align:right;"> 0.046 </td> <td style="text-align:right;"> 0.052 </td> <td style="text-align:right;"> 5.231 </td> </tr> <tr> <td style="text-align:left;"> PercentAlcohol </td> <td style="text-align:right;"> 1.799 </td> <td style="text-align:right;"> 0.242 </td> <td style="text-align:right;"> 7.443 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> 1.322 </td> <td style="text-align:right;"> 2.277 </td> </tr> </tbody> </table> --- ### Distribution of `Carbohydrates` ```r ggplot(data=beer,mapping=aes(x=Carbohydrates)) + geom_histogram() ``` <img src="05-anova_files/figure-html/unnamed-chunk-2-1.png" width="60%" style="display: block; margin: auto;" /> **Goal:** To understand the variation in `Carbohydrates` --- ## ANOVA The <font class="vocab">Analysis of Variance (ANOVA)</font> approach to regression is used to answer the following: - how much of the variation in the response variable can be attributed to the regression model - how much of the variation in the response can be attributed to factors not included in the model. <br><br> -- | | Sum of Squares | DF | Mean Square | F-Stat| p-value | |------------------|----------------|--------------------|-------------|-------------|--------------------| | Regression (Model) | `\(\sum\limits_{i=1}^{n}(\hat{y}_i - \bar{y})^2\)` | 1 | `\(MSS/1\)` | `\(MMS/RMS\)` | `\(P(F > \text{F-Stat})\)` | | Residual | `\(\sum\limits_{i=1}^{n}(y_i - \hat{y}_i)^2\)` | `\(n-2\)` | `\(RSS\)` / `\((n-2)\)` | | | | Total | `\(\sum\limits_{i=1}^{n}(y_i - \bar{y})^2\)` | `\(n-1\)` | `\(TSS\)` / `\((n-1)\)` | | | --- ## Total variation The total amount of variation in the `\(n\)` observed values of `\(y\)` is the <font class="vocab">Total Sum of Squares</font>. You can think of this as the variation about the sample mean, `\(\bar{y}\)`. `$$TSS = \sum\limits_{i=1}^n (y_i - \bar{y})^2 = (n-1)s_y^2$$` <img src="05-anova_files/figure-html/unnamed-chunk-3-1.png" width="65%" style="display: block; margin: auto;" /> --- ## Regression variation The variation of the predicted values about the overall mean is the <font class="vocab">Regression (Model) Sum of Squares</font>. You can think of this as how much information about `\(y\)` is explained by its relationship with `\(x\)`. `$$MSS = \sum\limits_{i=1}^n (\hat{y}_i - \bar{y})^2$$` <img src="05-anova_files/figure-html/unnamed-chunk-4-1.png" width="65%" style="display: block; margin: auto;" /> --- ## Residual variation The amount of variation in the responses for a given value of `\(x\)` is the <font class="vocab">Residual Sum of Squares</font>. You can think of this as the variation in the response values about the line if you know the value of `\(x\)`. `$$RSS = \sum\limits_{i=1}^n (y_i- \hat{y}_i)^2$$` <img src="05-anova_files/figure-html/unnamed-chunk-5-1.png" width="65%" style="display: block; margin: auto;" /> --- ## `\(R^2\)` - We have already used the sum of squares in the ANOVA table to calculate `\(R^2\)`, the proportion of variation in the response that is explained by variation in the predictor variable, i.e. the regression line. <br><br> .instructions[ `$$R^2 = \frac{\text{TSS} - \text{RSS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}}$$` ] --- ## Mean square The <font class="vocab">mean square (MS)</font> is calculated by dividing a sum of squares by its associated degrees of freedom <br><br> `\(\text{Total MS} = \frac{\text{TSS}}{n-1} = s_y^2\)` <br><br> `\(\text{Regression (Model) MS} = \frac{\text{MSS}}{1} = R^2(n-1)s_y^2\)` <br><br> `\(\text{Residual MS} = \frac{\text{RSS}}{n-2} = \hat{\sigma}^2\)` --- ## Test for `\(\beta_1\)` - We want to test if there is a significant linear relationship between the response and predictor variable `$$\begin{aligned}&H_0: \beta_1 = 0\\ &H_a: \beta_1 \neq 0\\\end{aligned}$$` - In the previous class, we tested these hypotheses using inference about the slope - Today we will test the hypotheses using analysis of variance. - This approach will be especially useful for multiple linear regression. --- ## Test statistic - The <font class="vocab3">test statistic </font> for ANOVA is `$$\text{F-statistic} = \frac{\text{Regression (Model) MS}}{\text{Residual MS}} = \frac{\text{MSS}/1}{{\text{RSS}}/(n-2)}$$` <br> - **Main idea:** If there is a significant linear relationship between `\(x\)` and `\(y\)`, the Regression MS will be much larger than the Residual MS. Therefore, a large F-statistic is evidence against `\(H_0\)`. --- ## p-value - The test statistic under this model follows an `\(F\)` distribution with `\(1\)` and `\(n-2\)` degrees of freedom - The `\(F\)` distribution is right-tailed, so to calculate the p-value, we use `$$\text{p-value} = P(F \geq \text{ F-statistic})$$` <img src="05-anova_files/figure-html/unnamed-chunk-6-1.png" width="70%" style="display: block; margin: auto;" /> --- ## ANOVA Table | | Sum of Squares | DF | Mean Square | F-Stat| p-value | |------------------|----------------|--------------------|-------------|-------------|--------------------| | Regression (Model) | `\(\sum\limits_{i=1}^{n}(\hat{y}_i - \bar{y})^2\)` | 1 | `\(MSS/1\)` | `\(MMS/RMS\)` | `\(P(F > \text{F-Stat})\)` | | Residual | `\(\sum\limits_{i=1}^{n}(y_i - \hat{y}_i)^2\)` | `\(n-2\)` | `\(RSS\)` / `\((n-2)\)` | | | | Total | `\(\sum\limits_{i=1}^{n}(y_i - \bar{y})^2\)` | `\(n-1\)` | `\(TSS\)` / `\((n-1)\)` | | | --- ### ANOVA for beer data ```r kable(tidy(aov(model)),format="html",digits=3) ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> df </th> <th style="text-align:right;"> sumsq </th> <th style="text-align:right;"> meansq </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> PercentAlcohol </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1009.861 </td> <td style="text-align:right;"> 1009.861 </td> <td style="text-align:right;"> 55.397 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Residuals </td> <td style="text-align:right;"> 157 </td> <td style="text-align:right;"> 2862.021 </td> <td style="text-align:right;"> 18.229 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> </tbody> </table> <br> .question[ - What are the total degrees of freedom? - What is `\(\hat{\sigma}^2\)`, the estimated regression variance? - Calculate and interpret `\(R^2\)`. ] --- class: middle, center ## Using ANOVA to compare group means --- class: middle Thus far, we have used a quantitative explanatory variable to understand the variation in a quantitative response variable. Now, we will use a **categorical (qualtitatve) explanatory variable** to understand the variation in a quantitative response variable. --- ## Carbohydrates vs. Brewery - Is there a relationship between the brewery and the total carbohydrates in beers from the top 5 domestic breweries? ```r top5_brewery <- c("MillerCoors","Flying Dog Brewery", "Anheuser Busch", "Sierra Nevada", "Budweiser") beer <- beer %>% filter(Brewery %in% top5_brewery) ``` --- ## Carbohydrates vs. Brewery ```r ggplot(data=beer,mapping=aes(x=Brewery,y=Carbohydrates)) + geom_boxplot() ``` <img src="05-anova_files/figure-html/unnamed-chunk-9-1.png" width="70%" style="display: block; margin: auto;" /> --- ##Notation - `\(K\)` is number of distinct groups. We index the groups as `\(k=1,\dots, K\)`. <br> -- - `\(n_k\)` is number of observations in group `\(k\)` <br> -- - `\(n=n_1 + n_2 + \dots + n_K\)` is the total number of observations <br> -- - `\(y_{kj}\)` is the `\(j^{th}\)` observation in group `\(k\)`, for all `\(k,j\)` <br> -- - `\(\mu_k\)` is the population mean for group `\(k\)`, for `\(k=1,\dots,K\)` --- ## Set Up - Is there a significant relationship between the explanatory variable `\(x\)` and the response variable `\(y\)`? - <font class="vocab">Goal: </font> Test whether the group means are equal -- - ANOVA model assumes `$$y_{kj} = \mu_{k} + \epsilon_{kj}$$` such that `\(\epsilon_{kj}\)` is the amount `\(y_{kj}\)` deviates from `\(\mu_k\)` -- - <font class="vocab">Assumption: </font> `\(\epsilon_{kj}\)` follows a Normal distribution with mean 0 and constant variance `\(\sigma^2\)`, i.e. `\(\epsilon_{kj} \sim N(0,\sigma^2)\)` + Another viewpoint: `\(y_{kj} \sim N(\mu_k, \sigma^2)\)` --- ## Hypotheses - <font class="vocab">General Question: </font> Is there a significant difference in the means across the `\(K\)` groups? -- - In other words, we want to test the following hypotheses: $$ `\begin{aligned} &H_0: \mu_1 = \mu_2 = \dots = \mu_K\\ &H_a: \text{At least one }\mu_i \text{ is not equal to the others} \end{aligned}` $$ -- - If the sample means are "far apart", " there is evidence against `\(H_0\)` -- - We will calculate a test statistic to assess "far apart" in the context of the data --- ## ANOVA - **Main Idea: ** Decompose the <font color="green">total variation</font> in the data into the variation <font color="blue">between groups</font> and the variation <font color="red">within each group</font> `$$\color{green}{\sum_{k=1}^{K}\sum_{j=1}^{n_k}(y_{kj}- \bar{y})^2}=\color{blue}{\sum_{k=1}^{K}n_k(\bar{y}_k-\bar{y})^2} + \color{red}{\sum_{k=1}^{K}\sum_{j=1}^{n_k}(y_{kj}-\bar{y}_k)^2}$$` <br> -- - If the variation <font color="blue">between groups</font> is significantly greater than the variation <font color="red">within each group</font>, then there is evidence against the null hypothesis. --- ## ANOVA table for comparing means .small[ | | Sum of Squares | DF | Mean Square | F-Stat| p-value | |------------------|----------------|--------------------|-------------|-------------|--------------------| | Between (Model) | `\(\sum\limits_{k=1}^{K}n_k(\bar{y}_k-\bar{y})^2\)` | `\(K-1\)` | `\(SSB/(K-1)\)` | `\(MSB/MSW\)` | `\(P(F > \text{F-Stat})\)` | | Within (Residual) | `\(\sum\limits_{k=1}^{K}\sum\limits_{j=1}^{n_k}(y_{kj}-\bar{y}_k)^2\)` | `\(n-K\)` | `\(SSW/(n-K)\)` | | | | Total | `\(\sum\limits_{k=1}^{K}\sum\limits_{j=1}^{n_k}(y_{kj}-\bar{y})^2\)` | `\(n-1\)` | `\(SST/(n-1)\)` | | | ] --- ## Total - Variation between and within groups `$$SST =\sum_{k=1}^{K}\sum_{j=1}^{n_k}(y_{kj}-\bar{y})^2$$` -- - <font class="vocab">Degrees of freedom</font> `$$DFT = n-1$$` -- - Estimate of the variance across all observations: `$$\frac{SST}{DFT} = \frac{\sum_{k=1}^{K}\sum_{j=1}^{n_k}(y_{kj}-\bar{y})^2}{n-1}$$` --- ## Between (Model) - Variation in the group means `$$SSB = \sum_{k=1}^{K}n_k(\bar{y}_k-\bar{y})^2$$` -- - <font class="vocab">Degrees of freedom</font> `$$DFB = K-1$$` -- - <font class="vocab">Mean Squares Between</font> `$$MSB = \frac{SSB}{DFB} = \frac{\sum_{k=1}^{K}n_i(\bar{y}_k-\bar{y})^2}{K-1}$$` -- - MSB Estimates the variance across the `\(\mu_i\)`'s --- class: regular ## Within (Residual) - Variation within each group `$$SSW=\sum_{k=1}^{K}\sum_{j=1}^{n_k}(y_{kj}-\bar{y}_k)^2$$` -- - <font class="vocab">Degrees of freedom</font> `$$DFW = n-K$$` -- - <font class="vocab">Mean Squares Within</font> `$$MSW = \frac{SSW}{DFW} = \frac{\sum_{k=1}^{K}\sum_{j=1}^{n_k}(y_{kj}-\bar{y}_i)^2}{n-K}$$` -- - MSW is the estimate of `\(\sigma^2\)`, the variance within each group --- ## Carbohydrates vs. Brewery ```r model_brewery <- lm(Carbohydrates ~ Brewery,data=beer) kable(tidy(aov(model_brewery)),format="html",digits=3) ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> df </th> <th style="text-align:right;"> sumsq </th> <th style="text-align:right;"> meansq </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Brewery </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 962.055 </td> <td style="text-align:right;"> 240.514 </td> <td style="text-align:right;"> 14.02 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Residuals </td> <td style="text-align:right;"> 88 </td> <td style="text-align:right;"> 1509.696 </td> <td style="text-align:right;"> 17.156 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> </tbody> </table> .question[ - How many observations in the dataset are from the top 5 breweries? - What is `\(\hat{\sigma}^2\)`, the estimated variance within each group? - State the null and alternative hypothesis for this test. What is your conclusion? ] --- ## Assumptions for ANOVA - <font class="vocab">Normality: </font> `\(y_{kj} \sim N(\mu_k, \sigma^2)\)` - <font class="vocab">Independence: </font> There is independence within and across groups - <font class="vocab">Constant Variance: </font> The population distribution for each group has a common variance, `\(\sigma^2\)` - We can typically check these assumptions in the exploratory data analysis --- ## Before next class - [Reading 03](https://www2.stat.duke.edu/courses/Spring19/sta210.001/reading/reading-03.html) due Monday - Understand inference and ANOVA for simple linear regression!