The goal of this exercise is to walk through a multinomial logistic regression analysis. It will give you a basic idea of the analysis steps and thought-process; however, due to class time constraints, this analysis is not exhaustive.
library(tidyverse)
library(knitr)
library(broom)
library(patchwork)
library(nnet)
This data is from the NHANES dataset found in the NHANES R package. Type ?NHANES in the console to read more about the data and variables. We’ll focus on the following variables for this analysis:
HealthGen: Self-reported rating of participant’s health in general. Excellent, Vgood, Good, Fair, or Poor.
Age: Age at time of screening (in years). Participants 80 or older were recorded as 80.
PhysActive: Participant does moderate to vigorous-intensity sports, fitness or recreational activities
library(NHANES)
nhanes_adult <- NHANES %>%
filter(Age >= 18) %>%
select(HealthGen, Age, PhysActive) %>%
drop_na() %>%
mutate(obs_num = 1:n())
glimpse(nhanes_adult)
## Rows: 6,710
## Columns: 4
## $ HealthGen <fct> Good, Good, Good, Good, Vgood, Vgood, Vgood, Vgood, Vgood,…
## $ Age <int> 34, 34, 34, 49, 45, 45, 45, 66, 58, 54, 50, 33, 60, 56, 56…
## $ PhysActive <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No,…
## $ obs_num <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
p1 <- ggplot(data = nhanes_adult, aes(x = Age)) +
geom_histogram() +
labs(title = "Distribution of Age")
p2 <- ggplot(data = nhanes_adult, aes(x = PhysActive)) +
geom_bar() +
labs(title = "Moderate or vigorous sport or exercise")
p3 <- ggplot(data = nhanes_adult, aes(x = HealthGen)) +
geom_bar() +
labs(title = "Self-reported rating of overall health")
p3 + (p1 / p2)
p4 <- ggplot(data = nhanes_adult, aes(x = HealthGen, y = Age)) +
geom_boxplot(fill = "steelblue") +
labs(title = "Age vs. Health Rating")
p5 <- ggplot(data = nhanes_adult, aes(x = HealthGen, fill = PhysActive)) +
geom_bar(position = "fill") +
labs(y = "Proportion",
title = "Physical Activity vs. Health Rating") +
coord_flip()
p4 + p5
health_m <- multinom(HealthGen ~ Age + PhysActive,
data = nhanes_adult)
tidy(health_m, conf.int = TRUE, exponentiate = FALSE) %>%
kable(digits = 3, format = "markdown")
| y.level | term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|---|
| Vgood | (Intercept) | 1.205 | 0.145 | 8.325 | 0.000 | 0.922 | 1.489 |
| Vgood | Age | 0.001 | 0.002 | 0.369 | 0.712 | -0.004 | 0.006 |
| Vgood | PhysActiveYes | -0.321 | 0.093 | -3.454 | 0.001 | -0.503 | -0.139 |
| Good | (Intercept) | 1.948 | 0.141 | 13.844 | 0.000 | 1.672 | 2.223 |
| Good | Age | -0.002 | 0.002 | -0.977 | 0.329 | -0.007 | 0.002 |
| Good | PhysActiveYes | -1.001 | 0.090 | -11.120 | 0.000 | -1.178 | -0.825 |
| Fair | (Intercept) | 0.915 | 0.164 | 5.566 | 0.000 | 0.592 | 1.237 |
| Fair | Age | 0.003 | 0.003 | 1.058 | 0.290 | -0.003 | 0.009 |
| Fair | PhysActiveYes | -1.645 | 0.107 | -15.319 | 0.000 | -1.856 | -1.435 |
| Poor | (Intercept) | -1.521 | 0.290 | -5.238 | 0.000 | -2.090 | -0.952 |
| Poor | Age | 0.022 | 0.005 | 4.522 | 0.000 | 0.013 | 0.032 |
| Poor | PhysActiveYes | -2.656 | 0.236 | -11.275 | 0.000 | -3.117 | -2.194 |
What is the baseline category for the model?
Write the model for the odds that a person rates themselves as having “Fair” health versus the baseline category.
Interpret the coefficient of Age and its 95% confidence interval in terms of the odds that a person rates themselves as having “Poor” health versus baseline category.
Interpret the coefficient of PhysActiveYes and its 95% confidence interval in terms of the odds that a person rates themselves as having “Very Good” health versus baseline category.
#calculate predicted probabilities
pred_probs <- as_tibble(predict(health_m, type = "probs")) %>%
mutate(obs_num = 1:n())
pred_probs %>%
slice(101:105)
## # A tibble: 5 x 6
## Excellent Vgood Good Fair Poor obs_num
## <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 0.0705 0.244 0.451 0.198 0.0366 101
## 2 0.0702 0.244 0.441 0.202 0.0426 102
## 3 0.0696 0.244 0.427 0.206 0.0527 103
## 4 0.0696 0.244 0.427 0.206 0.0527 104
## 5 0.155 0.393 0.359 0.0861 0.00662 105
health_m_aug <- inner_join(nhanes_adult, pred_probs,
by = "obs_num") %>%
select(obs_num, everything())
health_m_aug %>%
glimpse()
## Rows: 6,710
## Columns: 9
## $ obs_num <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ HealthGen <fct> Good, Good, Good, Good, Vgood, Vgood, Vgood, Vgood, Vgood,…
## $ Age <int> 34, 34, 34, 49, 45, 45, 45, 66, 58, 54, 50, 33, 60, 56, 56…
## $ PhysActive <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No,…
## $ Excellent <dbl> 0.07069715, 0.07069715, 0.07069715, 0.07003173, 0.15547075…
## $ Vgood <dbl> 0.2433979, 0.2433979, 0.2433979, 0.2444214, 0.3922335, 0.3…
## $ Good <dbl> 0.4573727, 0.4573727, 0.4573727, 0.4372533, 0.3599639, 0.3…
## $ Fair <dbl> 0.19568909, 0.19568909, 0.19568909, 0.20291032, 0.08585489…
## $ Poor <dbl> 0.032843150, 0.032843150, 0.032843150, 0.045383332, 0.0064…
health_m_aug <- health_m_aug %>%
mutate(pred_health = predict(health_m, type = "class"))
#rows = actual, columns = predicted
health_m_aug %>%
count(HealthGen, pred_health, .drop = FALSE) %>%
pivot_wider(names_from = pred_health, values_from = n)
## # A tibble: 5 x 6
## HealthGen Excellent Vgood Good Fair Poor
## <fct> <int> <int> <int> <int> <int>
## 1 Excellent 0 550 223 0 0
## 2 Vgood 0 1376 785 0 0
## 3 Good 0 1255 1399 0 0
## 4 Fair 0 300 642 0 0
## 5 Poor 0 24 156 0 0
Why do you think no observations were predicted to have a rating of “Excellent”, “Fair”, or “Poor”?