Multinomial Logistic Regression

The goal of this exercise is to walk through a multinomial logistic regression analysis. It will give you a basic idea of the analysis steps and thought-process; however, due to class time constraints, this analysis is not exhaustive.

library(tidyverse)
library(knitr)
library(broom)
library(patchwork)
library(nnet)

This data is from the NHANES dataset found in the NHANES R package. Type ?NHANES in the console to read more about the data and variables. We’ll focus on the following variables for this analysis:

HealthGen: Self-reported rating of participant’s health in general. Excellent, Vgood, Good, Fair, or Poor.
Age: Age at time of screening (in years). Participants 80 or older were recorded as 80.
PhysActive: Participant does moderate to vigorous-intensity sports, fitness or recreational activities

Load and prepare data

library(NHANES)

nhanes_adult <- NHANES %>%
  filter(Age >= 18) %>%
  select(HealthGen, Age, PhysActive) %>%
  drop_na() %>%
  mutate(obs_num = 1:n())

glimpse(nhanes_adult)

## Rows: 6,710
## Columns: 4
## $ HealthGen  <fct> Good, Good, Good, Good, Vgood, Vgood, Vgood, Vgood, Vgood,…
## $ Age        <int> 34, 34, 34, 49, 45, 45, 45, 66, 58, 54, 50, 33, 60, 56, 56…
## $ PhysActive <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No,…
## $ obs_num    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…

Exploratory data analysis

Univariate

p1 <- ggplot(data = nhanes_adult, aes(x = Age)) + 
  geom_histogram() +
  labs(title = "Distribution of Age")

p2 <- ggplot(data = nhanes_adult, aes(x = PhysActive)) + 
  geom_bar() +
  labs(title = "Moderate or vigorous sport or exercise")

p3 <- ggplot(data = nhanes_adult, aes(x = HealthGen)) + 
  geom_bar() +
  labs(title = "Self-reported rating of overall health")

p3 + (p1 / p2)

Bivariate

p4 <- ggplot(data = nhanes_adult, aes(x = HealthGen, y = Age)) +
  geom_boxplot(fill = "steelblue") + 
  labs(title = "Age vs. Health Rating")

p5 <- ggplot(data = nhanes_adult, aes(x = HealthGen, fill = PhysActive)) +
  geom_bar(position = "fill") +
  labs(y = "Proportion", 
       title = "Physical Activity vs. Health Rating") +
  coord_flip()
  
p4 + p5

Fit multinomial logistic model

health_m <- multinom(HealthGen ~ Age + PhysActive, 
                     data = nhanes_adult)

tidy(health_m, conf.int = TRUE, exponentiate = FALSE) %>%
  kable(digits = 3, format = "markdown")

y.level	term	estimate	std.error	statistic	p.value	conf.low	conf.high
Vgood	(Intercept)	1.205	0.145	8.325	0.000	0.922	1.489
Vgood	Age	0.001	0.002	0.369	0.712	-0.004	0.006
Vgood	PhysActiveYes	-0.321	0.093	-3.454	0.001	-0.503	-0.139
Good	(Intercept)	1.948	0.141	13.844	0.000	1.672	2.223
Good	Age	-0.002	0.002	-0.977	0.329	-0.007	0.002
Good	PhysActiveYes	-1.001	0.090	-11.120	0.000	-1.178	-0.825
Fair	(Intercept)	0.915	0.164	5.566	0.000	0.592	1.237
Fair	Age	0.003	0.003	1.058	0.290	-0.003	0.009
Fair	PhysActiveYes	-1.645	0.107	-15.319	0.000	-1.856	-1.435
Poor	(Intercept)	-1.521	0.290	-5.238	0.000	-2.090	-0.952
Poor	Age	0.022	0.005	4.522	0.000	0.013	0.032
Poor	PhysActiveYes	-2.656	0.236	-11.275	0.000	-3.117	-2.194

What is the baseline category for the model?

Write the model for the odds that a person rates themselves as having “Fair” health versus the baseline category.

Interpret the coefficient of Age and its 95% confidence interval in terms of the odds that a person rates themselves as having “Poor” health versus baseline category.

Interpret the coefficient of PhysActiveYes and its 95% confidence interval in terms of the odds that a person rates themselves as having “Very Good” health versus baseline category.

Predictions

Predictions: Probablities

#calculate predicted probabilities
pred_probs <- as_tibble(predict(health_m, type = "probs")) %>% 
                        mutate(obs_num = 1:n())

pred_probs %>%
  slice(101:105)

## # A tibble: 5 x 6
##   Excellent Vgood  Good   Fair    Poor obs_num
##       <dbl> <dbl> <dbl>  <dbl>   <dbl>   <int>
## 1    0.0705 0.244 0.451 0.198  0.0366      101
## 2    0.0702 0.244 0.441 0.202  0.0426      102
## 3    0.0696 0.244 0.427 0.206  0.0527      103
## 4    0.0696 0.244 0.427 0.206  0.0527      104
## 5    0.155  0.393 0.359 0.0861 0.00662     105

health_m_aug <- inner_join(nhanes_adult, pred_probs, 
                           by = "obs_num") %>%
  select(obs_num, everything())

health_m_aug %>% 
  glimpse()

## Rows: 6,710
## Columns: 9
## $ obs_num    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ HealthGen  <fct> Good, Good, Good, Good, Vgood, Vgood, Vgood, Vgood, Vgood,…
## $ Age        <int> 34, 34, 34, 49, 45, 45, 45, 66, 58, 54, 50, 33, 60, 56, 56…
## $ PhysActive <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No,…
## $ Excellent  <dbl> 0.07069715, 0.07069715, 0.07069715, 0.07003173, 0.15547075…
## $ Vgood      <dbl> 0.2433979, 0.2433979, 0.2433979, 0.2444214, 0.3922335, 0.3…
## $ Good       <dbl> 0.4573727, 0.4573727, 0.4573727, 0.4372533, 0.3599639, 0.3…
## $ Fair       <dbl> 0.19568909, 0.19568909, 0.19568909, 0.20291032, 0.08585489…
## $ Poor       <dbl> 0.032843150, 0.032843150, 0.032843150, 0.045383332, 0.0064…

Predictions: Category of response

health_m_aug <- health_m_aug %>% 
  mutate(pred_health = predict(health_m, type = "class"))

Actual vs. predicted

#rows = actual, columns = predicted
health_m_aug %>% 
  count(HealthGen, pred_health, .drop = FALSE) %>%
  pivot_wider(names_from = pred_health, values_from = n)

## # A tibble: 5 x 6
##   HealthGen Excellent Vgood  Good  Fair  Poor
##   <fct>         <int> <int> <int> <int> <int>
## 1 Excellent         0   550   223     0     0
## 2 Vgood             0  1376   785     0     0
## 3 Good              0  1255  1399     0     0
## 4 Fair              0   300   642     0     0
## 5 Poor              0    24   156     0     0

Why do you think no observations were predicted to have a rating of “Excellent”, “Fair”, or “Poor”?