The goal of this exercise is to walk through a multinomial logistic regression analysis. It will give you a basic idea of the analysis steps and thought-process; however, due to class time constraints, this analysis is not exhaustive.

library(tidyverse)
library(knitr)
library(broom)
library(patchwork)
library(nnet)

This data is from the NHANES dataset found in the NHANES R package. Type ?NHANES in the console to read more about the data and variables. We’ll focus on the following variables for this analysis:

Load and prepare data

library(NHANES)

nhanes_adult <- NHANES %>%
  filter(Age >= 18) %>%
  select(HealthGen, Age, PhysActive) %>%
  drop_na() %>%
  mutate(obs_num = 1:n())
glimpse(nhanes_adult)
## Rows: 6,710
## Columns: 4
## $ HealthGen  <fct> Good, Good, Good, Good, Vgood, Vgood, Vgood, Vgood, Vgood,…
## $ Age        <int> 34, 34, 34, 49, 45, 45, 45, 66, 58, 54, 50, 33, 60, 56, 56…
## $ PhysActive <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No,…
## $ obs_num    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…

Exploratory data analysis

Univariate

p1 <- ggplot(data = nhanes_adult, aes(x = Age)) + 
  geom_histogram() +
  labs(title = "Distribution of Age")

p2 <- ggplot(data = nhanes_adult, aes(x = PhysActive)) + 
  geom_bar() +
  labs(title = "Moderate or vigorous sport or exercise")

p3 <- ggplot(data = nhanes_adult, aes(x = HealthGen)) + 
  geom_bar() +
  labs(title = "Self-reported rating of overall health")

p3 + (p1 / p2)

Bivariate

p4 <- ggplot(data = nhanes_adult, aes(x = HealthGen, y = Age)) +
  geom_boxplot(fill = "steelblue") + 
  labs(title = "Age vs. Health Rating")

p5 <- ggplot(data = nhanes_adult, aes(x = HealthGen, fill = PhysActive)) +
  geom_bar(position = "fill") +
  labs(y = "Proportion", 
       title = "Physical Activity vs. Health Rating") +
  coord_flip()
  
p4 + p5

Fit multinomial logistic model

health_m <- multinom(HealthGen ~ Age + PhysActive, 
                     data = nhanes_adult)
tidy(health_m, conf.int = TRUE, exponentiate = FALSE) %>%
  kable(digits = 3, format = "markdown")
y.level term estimate std.error statistic p.value conf.low conf.high
Vgood (Intercept) 1.205 0.145 8.325 0.000 0.922 1.489
Vgood Age 0.001 0.002 0.369 0.712 -0.004 0.006
Vgood PhysActiveYes -0.321 0.093 -3.454 0.001 -0.503 -0.139
Good (Intercept) 1.948 0.141 13.844 0.000 1.672 2.223
Good Age -0.002 0.002 -0.977 0.329 -0.007 0.002
Good PhysActiveYes -1.001 0.090 -11.120 0.000 -1.178 -0.825
Fair (Intercept) 0.915 0.164 5.566 0.000 0.592 1.237
Fair Age 0.003 0.003 1.058 0.290 -0.003 0.009
Fair PhysActiveYes -1.645 0.107 -15.319 0.000 -1.856 -1.435
Poor (Intercept) -1.521 0.290 -5.238 0.000 -2.090 -0.952
Poor Age 0.022 0.005 4.522 0.000 0.013 0.032
Poor PhysActiveYes -2.656 0.236 -11.275 0.000 -3.117 -2.194

What is the baseline category for the model?

Write the model for the odds that a person rates themselves as having “Fair” health versus the baseline category.

Interpret the coefficient of Age and its 95% confidence interval in terms of the odds that a person rates themselves as having “Poor” health versus baseline category.

Interpret the coefficient of PhysActiveYes and its 95% confidence interval in terms of the odds that a person rates themselves as having “Very Good” health versus baseline category.

Predictions

Predictions: Probablities

#calculate predicted probabilities
pred_probs <- as_tibble(predict(health_m, type = "probs")) %>% 
                        mutate(obs_num = 1:n()) 
pred_probs %>%
  slice(101:105)
## # A tibble: 5 x 6
##   Excellent Vgood  Good   Fair    Poor obs_num
##       <dbl> <dbl> <dbl>  <dbl>   <dbl>   <int>
## 1    0.0705 0.244 0.451 0.198  0.0366      101
## 2    0.0702 0.244 0.441 0.202  0.0426      102
## 3    0.0696 0.244 0.427 0.206  0.0527      103
## 4    0.0696 0.244 0.427 0.206  0.0527      104
## 5    0.155  0.393 0.359 0.0861 0.00662     105
health_m_aug <- inner_join(nhanes_adult, pred_probs, 
                           by = "obs_num") %>%
  select(obs_num, everything())
health_m_aug %>% 
  glimpse()
## Rows: 6,710
## Columns: 9
## $ obs_num    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ HealthGen  <fct> Good, Good, Good, Good, Vgood, Vgood, Vgood, Vgood, Vgood,…
## $ Age        <int> 34, 34, 34, 49, 45, 45, 45, 66, 58, 54, 50, 33, 60, 56, 56…
## $ PhysActive <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No,…
## $ Excellent  <dbl> 0.07069715, 0.07069715, 0.07069715, 0.07003173, 0.15547075…
## $ Vgood      <dbl> 0.2433979, 0.2433979, 0.2433979, 0.2444214, 0.3922335, 0.3…
## $ Good       <dbl> 0.4573727, 0.4573727, 0.4573727, 0.4372533, 0.3599639, 0.3…
## $ Fair       <dbl> 0.19568909, 0.19568909, 0.19568909, 0.20291032, 0.08585489…
## $ Poor       <dbl> 0.032843150, 0.032843150, 0.032843150, 0.045383332, 0.0064…

Predictions: Category of response

health_m_aug <- health_m_aug %>% 
  mutate(pred_health = predict(health_m, type = "class"))

Actual vs. predicted

#rows = actual, columns = predicted
health_m_aug %>% 
  count(HealthGen, pred_health, .drop = FALSE) %>%
  pivot_wider(names_from = pred_health, values_from = n)
## # A tibble: 5 x 6
##   HealthGen Excellent Vgood  Good  Fair  Poor
##   <fct>         <int> <int> <int> <int> <int>
## 1 Excellent         0   550   223     0     0
## 2 Vgood             0  1376   785     0     0
## 3 Good              0  1255  1399     0     0
## 4 Fair              0   300   642     0     0
## 5 Poor              0    24   156     0     0

Why do you think no observations were predicted to have a rating of “Excellent”, “Fair”, or “Poor”?