class: center, middle, inverse, title-slide # Model Assessment ### Dr. Maria Tackett ### 02.13.19 --- ## Announcements - Lab 04 due today - HW 03 due Monday, Feb 18 --- ## R packages ```r library(tidyverse) library(knitr) library(broom) library(cowplot) # use plot_grid function ``` --- ## Restaurant tips What affects the amount customers tip at a restaurant? - **Response:** - <font class="vocab">`Tip`</font>: amount of the tip - **Predictors:** - <font class="vocab">`Party`</font>: number of people in the party - <font class="vocab">`Meal`</font>: time of day (Lunch, Dinner, Late Night) - <font class="vocab">`Age`</font>: age category of person paying the bill (Yadult, Middle, SenCit) ```r tips <- read_csv("data/tip-data.csv") %>% filter(!is.na(Party)) ``` --- ## ANOVA table for regression We can use the Analysis of Variance (ANOVA) table to decompose the variability in our response variable | | Sum of Squares | DF | Mean Square | F-Stat| p-value | |------------------|----------------|--------------------|-------------|-------------|--------------------| | Regression (Model) | `$$\sum\limits_{i=1}^{n}(\hat{y}_i - \bar{y})^2$$` | `$$p$$` | `$$\frac{MSS}{p}$$` | `$$\frac{MMS}{RMS}$$` | `$$P(F > \text{F-Stat})$$` | | Residual | `$$\sum\limits_{i=1}^{n}(y_i - \hat{y}_i)^2$$` | `$$n-p-1$$` | `$$\frac{RSS}{n-p-1}$$` | | | | Total | `$$\sum\limits_{i=1}^{n}(y_i - \bar{y})^2$$` | `$$n-1$$` | `$$\frac{TSS}{n-1}$$` | | | The estimate of the regression variance, `\(\hat{\sigma}^2 = RMS\)` --- ## `\(R^2\)` - **Recall**: `\(R^2\)` is the proportion of the variation in the response variable explained by the regression model <br> -- - `\(R^2\)` will always increase as we add more variables to the model + If we add enough variables, we can always achieve `\(R^2=100\%\)` <br> -- - If we only use `\(R^2\)` to choose a best fit model, we will be prone to choose the model with the most predictor variables --- ## Adjusted `\(R^2\)` - <font class="vocab">Adjusted `\(R^2\)`</font>: a version of `\(R^2\)` that penalizes for unnecessary predictor variables <br> - Similar to `\(R^2\)`, it measures the proportion of variation in the response that is explained by the regression model <br> - Differs from `\(R^2\)` by using the mean squares rather than sums of squares and therefore adjusting for the number of predictor variables --- ## `\(R^2\)` and Adjusted `\(R^2\)` `$$R^2 = \frac{\text{Total Sum of Squares} - \text{Residual Sum of Squares}}{\text{Total Sum of Squares}}$$` <br> -- .alert[ `$$Adj. R^2 = \frac{\text{Total Mean Square} - \text{Residual Mean Square}}{\text{Total Mean Square}}$$` ] <br> -- - `\(Adj. R^2\)` can be used as a quick assessment to compare the fit of multiple models; however, it should not be the only assessment! -- - Use `\(R^2\)` when describing the relationship between the response and predictor variables --- ### Restaurant tips: model ```r model1 <- lm(Tip ~ Party + Meal + Age , data = tips) kable(tidy(model1),format="html",digits=3) ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 1.254 </td> <td style="text-align:right;"> 0.394 </td> <td style="text-align:right;"> 3.182 </td> <td style="text-align:right;"> 0.002 </td> </tr> <tr> <td style="text-align:left;"> Party </td> <td style="text-align:right;"> 1.808 </td> <td style="text-align:right;"> 0.121 </td> <td style="text-align:right;"> 14.909 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> MealLate Night </td> <td style="text-align:right;"> -1.632 </td> <td style="text-align:right;"> 0.407 </td> <td style="text-align:right;"> -4.013 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> MealLunch </td> <td style="text-align:right;"> -0.612 </td> <td style="text-align:right;"> 0.402 </td> <td style="text-align:right;"> -1.523 </td> <td style="text-align:right;"> 0.130 </td> </tr> <tr> <td style="text-align:left;"> AgeSenCit </td> <td style="text-align:right;"> 0.390 </td> <td style="text-align:right;"> 0.394 </td> <td style="text-align:right;"> 0.990 </td> <td style="text-align:right;"> 0.324 </td> </tr> <tr> <td style="text-align:left;"> AgeYadult </td> <td style="text-align:right;"> -0.505 </td> <td style="text-align:right;"> 0.412 </td> <td style="text-align:right;"> -1.227 </td> <td style="text-align:right;"> 0.222 </td> </tr> </tbody> </table> --- ## Restaurant tips: ANOVA - <font class="vocab">R output</font> <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Df </th> <th style="text-align:right;"> Sum Sq </th> <th style="text-align:right;"> Mean Sq </th> <th style="text-align:right;"> F value </th> <th style="text-align:right;"> Pr(>F) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Party </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1188.636 </td> <td style="text-align:right;"> 1188.636 </td> <td style="text-align:right;"> 311.002 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Meal </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 88.460 </td> <td style="text-align:right;"> 44.230 </td> <td style="text-align:right;"> 11.573 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Age </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 13.032 </td> <td style="text-align:right;"> 6.516 </td> <td style="text-align:right;"> 1.705 </td> <td style="text-align:right;"> 0.185 </td> </tr> <tr> <td style="text-align:left;"> Residuals </td> <td style="text-align:right;"> 163 </td> <td style="text-align:right;"> 622.979 </td> <td style="text-align:right;"> 3.822 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> </tbody> </table> -- - <font class="vocab">ANOVA table</font> | | Sum of Squares | DF | Mean Square | F-Stat| p-value | |------------------|----------------|--------------------|-------------|-------------|--------------------| | Regression (Model) | 1290.12829 | 5 | 258.025658 | 67.5113618 | 0 | | Residual | 622.97932 | 163 | 3.821959 | | | | Total | 1913.10761 | 168 | | | | --- ### Calculating `\(R^2\)` and Adj `\(R^2\)` | | Sum of Squares | DF | Mean Square | F-Stat| p-value | |------------------|----------------|--------------------|-------------|-------------|--------------------| | Regression (Model) | 1290.12829 | 5 | 258.025658 | 67.5113618 | 0 | | Residual | 622.97932 | 163 | 3.821959 | | | | Total | 1913.10761 | 168 | | | | ```r #r-squared tss <- 1188.63588 + 88.46005 + 13.03236 + 622.97932 rss <- 622.97932 (r_sq <- (tss - rss)/tss) ``` ``` ## [1] 0.6743626 ``` -- ```r #adj r-squared tms <- tss/(nrow(tips)-1) rms <- 3.821959 (adj_r_sq <- (tms - rms)/tms) ``` ``` ## [1] 0.6643738 ``` --- ### Restaurant tips: `\(R^2\)` and Adj. `\(R^2\)` ```r glance(model1) ``` ``` ## # A tibble: 1 x 11 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.674 0.664 1.95 67.5 6.14e-38 6 -350. 714. 736. ## # … with 2 more variables: deviance <dbl>, df.residual <int> ``` <br> - Close values of `\(R^2\)` and Adjusted `\(R^2\)` indicate that the variables in the model are significant in understanding variation in tips --- ## ANOVA F Test - Using the ANOVA table, we can test whether any variable in the model is a significant predictor of the response. We conduct this test using the following hypotheses: .alert[ `$$\begin{aligned}&H_0: \beta_{1} = \beta_{2} = \dots = \beta_p = 0 \\ &H_a: \text{at least one }\beta_j \text{ is not equal to 0}\end{aligned}$$` ] <br> - The statistic for this test is the `\(F\)` test statistic in the ANOVA table - We calculate the p-value using an `\(F\)` distribution with `\(p\)` and `\((n-p-1)\)` degrees of freedom --- ## ANOVA F Test in R ```r model0 <- lm(Tip ~ 1, data=tips) ``` -- ```r model1 <- lm(Tip ~ Party + Meal + Age , data = tips) ``` -- ```r kable(anova(model0,model1),format="html") ``` <table> <thead> <tr> <th style="text-align:right;"> Res.Df </th> <th style="text-align:right;"> RSS </th> <th style="text-align:right;"> Df </th> <th style="text-align:right;"> Sum of Sq </th> <th style="text-align:right;"> F </th> <th style="text-align:right;"> Pr(>F) </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 168 </td> <td style="text-align:right;"> 1913.1076 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:right;"> 163 </td> <td style="text-align:right;"> 622.9793 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 1290.128 </td> <td style="text-align:right;"> 67.51136 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> **At least one coefficient is non-zero, i.e. at least one predictor in the model is significant** --- ### Testing subset of coefficients - Sometimes we want to test whether a subset of coefficients are all equal to 0 - This is often the case when we want test - whether a categorical variable with `\(k\)` levels is a significant predictor of the response - whether the interaction between a categorical and quantitative variable is significant - To do so, we will use the <font class="vocab3">Nested F Test </font> --- ## Nested F Test - Suppose we have a full and reduced model: `$$\begin{aligned}&\text{Full}: y = \beta_0 + \beta_1 x_1 + \dots + \beta_q x_q + \beta_{q+1} x_{q+1} + \dots \beta_p x_p \\ &\text{Red}: y = \beta_0 + \beta_1 x_1 + \dots + \beta_q x_q\end{aligned}$$` <br> - We want to test whether any of the variables `\(x_{q+1}, x_{q+2}, \ldots, x_p\)` are significant predictors. To do so, we will test the hypothesis: .alert[ `$$\begin{aligned}&H_0: \beta_{q+1} = \beta_{q+2} = \dots = \beta_p = 0 \\ &H_a: \text{at least one }\beta_j \text{ is not equal to 0}\end{aligned}$$` ] --- ## Nested F Test - The test statistic for this test is `$$F = \frac{(RSS_{reduced} - RSS_{full})\big/(p_{full} - p_{reduced})}{RSS_{full}\big/(n-p_{full}-1)}$$` <br> - Calculate the p-value using the F distribution with `\((p_{full} - p_{reduced})\)` and `\((n-p_{full}-1)\)` degrees of freedom --- ### Is `Meal` a significant predictor of tips? <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 1.254 </td> <td style="text-align:right;"> 0.394 </td> <td style="text-align:right;"> 3.182 </td> <td style="text-align:right;"> 0.002 </td> </tr> <tr> <td style="text-align:left;"> Party </td> <td style="text-align:right;"> 1.808 </td> <td style="text-align:right;"> 0.121 </td> <td style="text-align:right;"> 14.909 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> AgeSenCit </td> <td style="text-align:right;"> 0.390 </td> <td style="text-align:right;"> 0.394 </td> <td style="text-align:right;"> 0.990 </td> <td style="text-align:right;"> 0.324 </td> </tr> <tr> <td style="text-align:left;"> AgeYadult </td> <td style="text-align:right;"> -0.505 </td> <td style="text-align:right;"> 0.412 </td> <td style="text-align:right;"> -1.227 </td> <td style="text-align:right;"> 0.222 </td> </tr> <tr> <td style="text-align:left;"> MealLate Night </td> <td style="text-align:right;"> -1.632 </td> <td style="text-align:right;"> 0.407 </td> <td style="text-align:right;"> -4.013 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> MealLunch </td> <td style="text-align:right;"> -0.612 </td> <td style="text-align:right;"> 0.402 </td> <td style="text-align:right;"> -1.523 </td> <td style="text-align:right;"> 0.130 </td> </tr> </tbody> </table> --- ### Tips data: Nested F Test `$$\begin{aligned}&H_0: \beta_{late night} = \beta_{lunch} = 0\\ &H_a: \text{ at least one }\beta_j \text{ is not equal to 0}\end{aligned}$$` -- ```r reduced <- lm(Tip ~ Party + Age, data = tips) ``` -- ```r full <- lm(Tip ~ Party + Age + Meal, data = tips) ``` -- ```r kable(anova(full,reduced),format="html") ``` <table> <thead> <tr> <th style="text-align:right;"> Res.Df </th> <th style="text-align:right;"> RSS </th> <th style="text-align:right;"> Df </th> <th style="text-align:right;"> Sum of Sq </th> <th style="text-align:right;"> F </th> <th style="text-align:right;"> Pr(>F) </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 163 </td> <td style="text-align:right;"> 622.9793 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:right;"> 165 </td> <td style="text-align:right;"> 686.4439 </td> <td style="text-align:right;"> -2 </td> <td style="text-align:right;"> -63.46457 </td> <td style="text-align:right;"> 8.302623 </td> <td style="text-align:right;"> 0.0003684 </td> </tr> </tbody> </table> -- **At least one coefficient associated with `Meal` is not zero. Therefore, `Meal` is a significant predictor of `Tips`.** --- class: middle .question[ Why is it not good practice to use the individual p-values to determine a categorical variable with `\(k > 2\)` levels) is significant? *Hint*: What does it actually mean if none of the `\(k-1\)` p-values are significant? ] --- ## Practice with Interactions <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 1.2764989 </td> <td style="text-align:right;"> 0.4910882 </td> <td style="text-align:right;"> 2.5993270 </td> <td style="text-align:right;"> 0.0102086 </td> </tr> <tr> <td style="text-align:left;"> Party </td> <td style="text-align:right;"> 1.7947980 </td> <td style="text-align:right;"> 0.1715003 </td> <td style="text-align:right;"> 10.4652753 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> <tr> <td style="text-align:left;"> AgeSenCit </td> <td style="text-align:right;"> 0.4007889 </td> <td style="text-align:right;"> 0.3969295 </td> <td style="text-align:right;"> 1.0097230 </td> <td style="text-align:right;"> 0.3141431 </td> </tr> <tr> <td style="text-align:left;"> AgeYadult </td> <td style="text-align:right;"> -0.4701634 </td> <td style="text-align:right;"> 0.4197146 </td> <td style="text-align:right;"> -1.1201978 </td> <td style="text-align:right;"> 0.2642977 </td> </tr> <tr> <td style="text-align:left;"> MealLate Night </td> <td style="text-align:right;"> -1.8454674 </td> <td style="text-align:right;"> 0.7089728 </td> <td style="text-align:right;"> -2.6030159 </td> <td style="text-align:right;"> 0.0101039 </td> </tr> <tr> <td style="text-align:left;"> MealLunch </td> <td style="text-align:right;"> -0.4608832 </td> <td style="text-align:right;"> 0.8651044 </td> <td style="text-align:right;"> -0.5327487 </td> <td style="text-align:right;"> 0.5949421 </td> </tr> <tr> <td style="text-align:left;"> Party:MealLate Night </td> <td style="text-align:right;"> 0.1108600 </td> <td style="text-align:right;"> 0.2846584 </td> <td style="text-align:right;"> 0.3894491 </td> <td style="text-align:right;"> 0.6974586 </td> </tr> <tr> <td style="text-align:left;"> Party:MealLunch </td> <td style="text-align:right;"> -0.0500822 </td> <td style="text-align:right;"> 0.2825586 </td> <td style="text-align:right;"> -0.1772455 </td> <td style="text-align:right;"> 0.8595384 </td> </tr> </tbody> </table> .question[ 1. Write the general form of the model. 2. Write the model for `Meal == "Late Night"`. 3. How does the mean change when `Meal == "Late Night"`? 4. How does the slope of `Party` change when `Meal == "Late Night"`? ] --- ### Nested F test for interactions **Is the interaction between `Party` and `Meal` significant?** ```r reduced <- lm(Tip ~ Party + Age + Meal, data = tips) ``` -- ```r full <- lm(Tip ~ Party + Age + Meal + Meal*Party, data = tips) ``` -- ```r kable(anova(full,reduced),format="html") ``` <table> <thead> <tr> <th style="text-align:right;"> Res.Df </th> <th style="text-align:right;"> RSS </th> <th style="text-align:right;"> Df </th> <th style="text-align:right;"> Sum of Sq </th> <th style="text-align:right;"> F </th> <th style="text-align:right;"> Pr(>F) </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 161 </td> <td style="text-align:right;"> 621.9651 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:right;"> 163 </td> <td style="text-align:right;"> 622.9793 </td> <td style="text-align:right;"> -2 </td> <td style="text-align:right;"> -1.014261 </td> <td style="text-align:right;"> 0.1312743 </td> <td style="text-align:right;"> 0.877071 </td> </tr> </tbody> </table>