class: center, middle, inverse, title-slide # Multinomial Logistic Regression ## Predictions & Drop-in Deviance Test ### Dr. Maria Tackett ### 11.04.19 --- class: middle, center ### [Click for PDF of slides](20-multinomial-logistic-pt2.pdf) --- ### Announcements - Multinomial Logistic Regression: [Reading 10](https://www2.stat.duke.edu/courses/Fall19/sta210.001/reading/reading-10.html) and [Reading 11](https://www2.stat.duke.edu/courses/Fall19/sta210.001/reading/reading-11.html) - HW 05 **due TODAY at 11:59p** --- ### Generalized Linear Models (GLM) - In practice, there are many different types of response variables including: + **Binary**: Win or Lose + **Nominal**: Democrat, Republican or Third Party candidate + **Ordered**: Movie rating (1 - 5 stars) + and others... - These are all examples of **generalized linear models**, a broader class of models that generalize the multiple linear regression model - See [*Generalized Linear Models: A Unifying Theory*](https://bookdown.org/roback/bookdown-bysh/ch-glms.html) for more details about GLMs --- ### Binary Response (Logistic) - Suppose we consider `\(y=0\)` the *<font color="blue">baseline category</font>* such that `$$P(y_i=0|x_i) = p_{i0} \text{ and } P(y_i=1|x_i) = p_{i1}$$` -- - Then the logit model is `$$\log\bigg(\frac{p_{i1}}{p_{i0}}\bigg) = \beta_0 + \beta_1 x_i$$` -- - <font class="vocab">Slope, `\(\beta_1\)`</font>: When `\(x\)` increases by one unit, the odds of `\(Y=1\)` versus the baseline `\(Y=0\)` are expected to multiply by a factor of `\(\exp\{\beta_1\}\)` - <font class="vocab3">Intercept, `\(\beta_0\)`</font>: When `\(x=0\)`, the odds of `\(y=1\)` versus the baseline `\(y=0\)` are expected to be `\(\exp\{\beta_0\}\)` --- ### Multinomial response variable - Suppose the response variable `\(y\)` is categorical and can take values `\(1, 2, \ldots, k\)` such that `\((k > 2)\)` - <font class="vocab">Multinomial Distribution: </font> `$$P(Y=1) = p_1, P(Y=2) = p_2, \ldots, P(Y=k) = p_k$$` such that `\(\sum\limits_{j=1}^{k} p_j = 1\)` --- ### Multinomial Logistic Regression - Suppose we have a response variable `\(Y\)` that can take three possible outcomes that are coded as "1", "2", "3" - Let "1" be the baseline category. Then .alert[ `$$\log\bigg(\frac{p_{i2}}{p_{i1}}\bigg) = \beta_{02} + \beta_{12} X_i \\[15pt] \log\bigg(\frac{p_{i3}}{p_{i1}}\bigg) = \beta_{03} + \beta_{13} X_i$$` ] --- ### Multinomial Regression in R - Use the <font class="vocab">`multinom()`</font> function in the `nnet` package ```r library(nnet) my.model <- multinom(Y ~ X1 + X2 + ... + XP, data=my.data) tidy(my.model, exponentiate = FALSE) #display log-odds model ``` <br> ```r # calculate predicted probabilities pred.probs <- predict(my.model, type = "probs") ``` --- ### NHANES Data - [National Health and Nutrition Examination Survey](https://www.cdc.gov/nchs/nhanes/index.htm) is conducted by the National Center for Health Statistics (NCHS) - The goal is to *"assess the health and nutritional status of adults and children in the United States"* - This survey includes an interview and a physical examination --- ### NHANES Data - We will use the data from the <font class="vocab">`NHANES`</font> R package - Contains 75 variables for the 2009 - 2010 and 2011 - 2012 sample years - The data in this package is modified for educational purposes and should **not** be used for research - Original data can be obtained from the [NCHS website](https://www.cdc.gov/nchs/data_access/index.htm) for research purposes - Type <font class="vocab">`?NHANES`</font> in console to see list of variables and definitions --- ### NHANES: Health Rating vs. Age & Physical Activity - **Question**: Can we use a person's age and whether they do regular physical activity to predict their self-reported health rating? - We will analyze the following variables: + <font class="vocab">`HealthGen`: </font>Self-reported rating of participant's health in general. Excellent, Vgood, Good, Fair, or Poor. + <font class="vocab">`Age`: </font>Age at time of screening (in years). Participants 80 or older were recorded as 80. + <font class="vocab">`PhysActive`: </font>Participant does moderate to vigorous-intensity sports, fitness or recreational activities --- ### The data ```r library(NHANES) nhanes_adult <- NHANES %>% filter(Age >= 18) %>% select(HealthGen, Age, PhysActive, Education) %>% drop_na() %>% mutate(obs_num = 1:n()) ``` ```r glimpse(nhanes_adult) ``` ``` ## Observations: 6,465 ## Variables: 5 ## $ HealthGen <fct> Good, Good, Good, Good, Vgood, Vgood, Vgood, Vgood, V… ## $ Age <int> 34, 34, 34, 49, 45, 45, 45, 66, 58, 54, 50, 33, 60, 5… ## $ PhysActive <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No… ## $ Education <fct> High School, High School, High School, Some College, … ## $ obs_num <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16… ``` --- ### Exploratory data analysis <img src="20-multinomial-logistic-pt2_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ### Exploratory data analysis <img src="20-multinomial-logistic-pt2_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ### `HealthGen` vs. `Age` and `PhysActive` ```r library(nnet) health_m <- multinom(HealthGen ~ Age + PhysActive, data = nhanes_adult) ``` - Put `results = "hide"` in the code chunk header to suppress convergence output --- ### `HealthGen` vs. `Age` and `PhysActive` ```r tidy(health_m, exponentiate = FALSE, conf.int = TRUE) %>% kable(digits = 3, format = "markdown") ``` |y.level |term | estimate| std.error| statistic| p.value| conf.low| conf.high| |:-------|:-------------|--------:|---------:|---------:|-------:|--------:|---------:| |Vgood |(Intercept) | 1.265| 0.154| 8.235| 0.000| 0.964| 1.567| |Vgood |Age | 0.000| 0.003| -0.014| 0.989| -0.005| 0.005| |Vgood |PhysActiveYes | -0.332| 0.095| -3.496| 0.000| -0.518| -0.146| |Good |(Intercept) | 1.989| 0.150| 13.285| 0.000| 1.695| 2.282| |Good |Age | -0.003| 0.003| -1.187| 0.235| -0.008| 0.002| |Good |PhysActiveYes | -1.011| 0.092| -10.979| 0.000| -1.192| -0.831| |Fair |(Intercept) | 1.033| 0.174| 5.938| 0.000| 0.692| 1.374| |Fair |Age | 0.001| 0.003| 0.373| 0.709| -0.005| 0.007| |Fair |PhysActiveYes | -1.662| 0.109| -15.190| 0.000| -1.877| -1.448| |Poor |(Intercept) | -1.338| 0.299| -4.475| 0.000| -1.924| -0.752| |Poor |Age | 0.019| 0.005| 3.827| 0.000| 0.009| 0.029| |Poor |PhysActiveYes | -2.670| 0.236| -11.308| 0.000| -3.133| -2.208| --- ### Interpreting coefficients .question[ 1. What is the model baseline category, i.e. the baseline category of the response variable? 2. Write the model for the odds that a person rates themselves as having "Fair" health versus the model baseline category. 3. Interpret the coefficient for `Age` in terms of the odds that a person rates themselves has having "Poor" heath versus the model's baseline category ] --- ### Model assessment For each category of the response, `\(j\)`: - Analyze a plot of the binned residuals vs. predicted probabilities - Analyze a plot of the binned residuals vs. each continuous predictor variable - Look for any patterns in the residuals plots - For each categorical predictor variable, examine the average residuals for each category of the response variable --- ### NHANES: Predicted probabilities ```r #calculate predicted probabilities pred_probs <- as_tibble(predict(health_m, type = "probs")) %>% mutate(obs_num = 1:n()) ``` ```r pred_probs %>% slice(1:10) ``` ``` ## # A tibble: 10 x 6 ## Excellent Vgood Good Fair Poor obs_num ## <dbl> <dbl> <dbl> <dbl> <dbl> <int> ## 1 0.0687 0.243 0.453 0.201 0.0348 1 ## 2 0.0687 0.243 0.453 0.201 0.0348 2 ## 3 0.0687 0.243 0.453 0.201 0.0348 3 ## 4 0.0691 0.244 0.435 0.205 0.0467 4 ## 5 0.155 0.393 0.359 0.0868 0.00671 5 ## 6 0.155 0.393 0.359 0.0868 0.00671 6 ## 7 0.155 0.393 0.359 0.0868 0.00671 7 ## 8 0.157 0.400 0.342 0.0904 0.0102 8 ## 9 0.156 0.397 0.349 0.0890 0.00872 9 ## 10 0.156 0.396 0.352 0.0883 0.00804 10 ``` --- ### NHANES: Residuals ```r #calculate residuals residuals <- as_tibble(residuals(health_m)) %>% #calculate residuals setNames(paste('resid.', names(.), sep = "")) %>% #update column names mutate(obs_num = 1:n()) #add obs number ``` ```r residuals %>% slice(1:10) ``` ``` ## # A tibble: 10 x 6 ## resid.Excellent resid.Vgood resid.Good resid.Fair resid.Poor obs_num ## <dbl> <dbl> <dbl> <dbl> <dbl> <int> ## 1 -0.0687 -0.243 0.547 -0.201 -0.0348 1 ## 2 -0.0687 -0.243 0.547 -0.201 -0.0348 2 ## 3 -0.0687 -0.243 0.547 -0.201 -0.0348 3 ## 4 -0.0691 -0.244 0.565 -0.205 -0.0467 4 ## 5 -0.155 0.607 -0.359 -0.0868 -0.00671 5 ## 6 -0.155 0.607 -0.359 -0.0868 -0.00671 6 ## 7 -0.155 0.607 -0.359 -0.0868 -0.00671 7 ## 8 -0.157 0.600 -0.342 -0.0904 -0.0102 8 ## 9 -0.156 0.603 -0.349 -0.0890 -0.00872 9 ## 10 -0.156 -0.396 -0.352 0.912 -0.00804 10 ``` --- ### Make "augmented" dataset ```r health_m_aug <- inner_join(nhanes_adult, pred_probs) #add probs health_m_aug <- inner_join(health_m_aug, residuals) #add resid ``` ```r health_m_aug %>% glimpse() ``` ``` ## Observations: 6,465 ## Variables: 15 ## $ HealthGen <fct> Good, Good, Good, Good, Vgood, Vgood, Vgood, Vgo… ## $ Age <int> 34, 34, 34, 49, 45, 45, 45, 66, 58, 54, 50, 33, … ## $ PhysActive <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Ye… ## $ Education <fct> High School, High School, High School, Some Coll… ## $ obs_num <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1… ## $ Excellent <dbl> 0.06870508, 0.06870508, 0.06870508, 0.06906126, … ## $ Vgood <dbl> 0.2432327, 0.2432327, 0.2432327, 0.2443614, 0.39… ## $ Good <dbl> 0.4527247, 0.4527247, 0.4527247, 0.4348186, 0.35… ## $ Fair <dbl> 0.20055763, 0.20055763, 0.20055763, 0.20503866, … ## $ Poor <dbl> 0.034779881, 0.034779881, 0.034779881, 0.0467200… ## $ resid.Excellent <dbl> -0.06870508, -0.06870508, -0.06870508, -0.069061… ## $ resid.Vgood <dbl> -0.2432327, -0.2432327, -0.2432327, -0.2443614, … ## $ resid.Good <dbl> 0.5472753, 0.5472753, 0.5472753, 0.5651814, -0.3… ## $ resid.Fair <dbl> -0.20055763, -0.20055763, -0.20055763, -0.205038… ## $ resid.Poor <dbl> -0.034779881, -0.034779881, -0.034779881, -0.046… ``` --- ### Binned residuals vs. pred. probabilities <img src="20-multinomial-logistic-pt2_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /><img src="20-multinomial-logistic-pt2_files/figure-html/unnamed-chunk-16-2.png" style="display: block; margin: auto;" /> --- ### Binned residuals vs. `Age` <img src="20-multinomial-logistic-pt2_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /><img src="20-multinomial-logistic-pt2_files/figure-html/unnamed-chunk-17-2.png" style="display: block; margin: auto;" /> --- ### Residuals vs. `PhysActive` ```r health_m_aug %>% group_by(PhysActive) %>% summarise(mean.Excellent = mean(resid.Excellent), mean.Vgood = mean(resid.Vgood), mean.Good = mean(resid.Good), mean.Fair = mean(resid.Fair), mean.Poor = mean(resid.Poor)) %>% t() ``` ``` ## [,1] [,2] ## PhysActive "No" "Yes" ## mean.Excellent "-2.683227e-07" " 1.732639e-06" ## mean.Vgood " 4.866088e-07" "-1.193899e-06" ## mean.Good " 7.868508e-07" "-1.241316e-06" ## mean.Fair "-1.081921e-06" " 6.788893e-07" ## mean.Poor "7.678444e-08" "2.368658e-08" ``` --- ## Calculating probabilities For `\(j = 2,\ldots, k\)`, we calculate the probability `\(p_{ij}\)` as .alert[ `$$p_{ij} = \frac{\exp\{\beta_{0j} + \beta_{1j}x_i\}}{1 + \sum\limits_{j=2}^k \exp\{\beta_{0j} + \beta_{1j}x_i\}}$$` ] -- For the baseline category `\((j = 1)\)` we calculate the probability `\((p_{i1})\)` as .alert[ `$$p_{i1} = 1- \sum\limits_{j=2}^k p_{ij}$$` ] -- We will use these probabilities to assign a category of the response for each observation --- ### Actual vs. Predicted Health Rating - We can use our model to predict a person's health rating given their age and whether they exercise - For each observation, the predicted health rating is the one with the highest predicted probability ```r health_m_aug <- health_m_aug %>% mutate(pred_health = predict(health_m, type = "class")) ``` --- ### Actual vs. Predicted Health Rating ```r health_m_aug %>% count(HealthGen, pred_health, .drop = FALSE) %>% pivot_wider(names_from = pred_health, values_from = n) ``` ``` ## # A tibble: 5 x 6 ## HealthGen Excellent Vgood Good Fair Poor ## <fct> <int> <int> <int> <int> <int> ## 1 Excellent 0 528 210 0 0 ## 2 Vgood 0 1341 743 0 0 ## 3 Good 0 1226 1316 0 0 ## 4 Fair 0 296 625 0 0 ## 5 Poor 0 24 156 0 0 ``` ```r #rows = actual, columns = predicted ``` --- ## Predictions ``` ## # A tibble: 5 x 6 ## Excellent Vgood Good Fair Poor pred_health ## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> ## 1 0.0687 0.243 0.453 0.201 0.0348 Good ## 2 0.0687 0.243 0.453 0.201 0.0348 Good ## 3 0.0687 0.243 0.453 0.201 0.0348 Good ## 4 0.0691 0.244 0.435 0.205 0.0467 Good ## 5 0.155 0.393 0.359 0.0868 0.00671 Vgood ``` --- ### Drop-in-deviance Test - Suppose there are two models: - Model 1 includes predictors `\(x_1, \ldots, x_q\)` - Model 2 includes predictors `\(x_1, \ldots, x_q, x_{q+1}, \ldots, x_p\)` - We want to test the hypotheses `$$\begin{aligned}&H_0: \beta_{q+1} = \dots = \beta_p = 0 \\ & H_a: \text{ at least 1 }\beta_j \text{ is not} 0\end{aligned}$$` - Use the **drop-in-deviance test** to compare models (similar to logistic regression) --- ### Add `Education` to the model? - We consider adding the participants' `Education` level to the model. - Education takes values `8thGrade`, `9-11thGrade`, `HighSchool`, `SomeCollege`, and `CollegeGrad` - Models we're testing: - Model 1: `Age`, `PhysActive` - Model 2: `Age`, `PhysActive`, `Education` .alert[ `$$\begin{align}&H_0: \beta_{9-11thGrade} = \beta_{HighSchool} = \beta_{SomeCollege} = \beta_{CollegeGrad}\\ &H_a: \text{ at least one }\beta_j \text{ is not equal to }0\end{align}$$` ] --- ### Add `Education` to the model? .alert[ `$$\begin{align}&H_0: \beta_{9-11thGrade} = \beta_{HighSchool} = \beta_{SomeCollege} = \beta_{CollegeGrad}\\ &H_a: \text{ at least one }\beta_j \text{ is not equal to }0\end{align}$$` ] ```r m1 <- multinom(HealthGen ~ Age + PhysActive, data = nhanes_adult) m2 <- multinom(HealthGen ~ Age + PhysActive + Education, data = nhanes_adult) ``` --- ### Add `Education` to the model? ```r m1 <- multinom(HealthGen ~ Age + PhysActive, data = nhanes_adult) m2 <- multinom(HealthGen ~ Age + PhysActive + Education, data = nhanes_adult) ``` ```r kable(anova(m1, m2, test = "Chisq"), format = "markdown") ``` |Model | Resid. df| Resid. Dev|Test | Df| LR stat.| Pr(Chi)| |:----------------------------|---------:|----------:|:------|-----:|--------:|-------:| |Age + PhysActive | 25848| 16994.23| | NA| NA| NA| |Age + PhysActive + Education | 25832| 16505.10|1 vs 2 | 16| 489.1319| 0| -- At least one coefficient associated with `Education` is non-zero. Therefore, `Education` is a statistically significant predictor for `HealthGen`.