---
title: "Multinomial Logistic Regression"
date: "`r Sys.Date()`"
output:
html_document:
theme: readable
---
```{r global-options, include=FALSE}
knitr::opts_chunk$set(warning=FALSE,
message=FALSE)
```
The goal of this exercise is to walk through a multinomial logistic regression analysis. It will give you a basic idea of the analysis steps and thought-process; however, due to class time constraints, this analysis is not exhaustive.
```{r load-packages}
library(tidyverse)
library(knitr)
library(broom)
library(patchwork)
library(nnet)
```
This data is from the `NHANES` dataset found in the NHANES R package. Type `?NHANES` in the console to read more about the data and variables. We'll focus on the following variables for this analysis:
- `HealthGen`: Self-reported rating of participant's health in general. Excellent, Vgood, Good, Fair, or Poor.
- `Age`: Age at time of screening (in years). Participants 80 or older were recorded as 80.
- `PhysActive`: Participant does moderate to vigorous-intensity sports, fitness or recreational activities
## Load and prepare data
```{r load-data}
library(NHANES)
nhanes_adult <- NHANES %>%
filter(Age >= 18) %>%
select(HealthGen, Age, PhysActive) %>%
drop_na() %>%
mutate(obs_num = 1:n())
```
```{r}
glimpse(nhanes_adult)
nhanes_adult <- nhanes_adult %>%
mutate(HealthGen = fct_relevel(HealthGen, "Poor", "Fair", "Good", "Vgood", "Excellent"))
```
## Exploratory data analysis
### Univariate
```{r}
p1 <- ggplot(data = nhanes_adult, aes(x = Age)) +
geom_histogram() +
labs(title = "Distribution of Age")
p2 <- ggplot(data = nhanes_adult, aes(x = PhysActive)) +
geom_bar() +
labs(title = "Moderate or vigorous sport or exercise")
p3 <- ggplot(data = nhanes_adult, aes(x = HealthGen)) +
geom_bar() +
labs(title = "Self-reported rating of overall health")
p3 + (p1 / p2)
```
### Bivariate
```{r}
p4 <- ggplot(data = nhanes_adult, aes(x = HealthGen, y = Age)) +
geom_boxplot(fill = "steelblue") +
labs(title = "Age vs. Health Rating")
p5 <- ggplot(data = nhanes_adult, aes(x = HealthGen, fill = PhysActive)) +
geom_bar(position = "fill") +
labs(y = "Proportion",
title = "Physical Activity vs. Health Rating") +
coord_flip()
p4 + p5
```
## Fit multinomial logistic model
```{r results = "hide"}
health_m <- multinom(HealthGen ~ Age + PhysActive,
data = nhanes_adult)
```
```{r}
tidy(health_m, conf.int = TRUE, exponentiate = FALSE) %>%
kable(digits = 3, format = "markdown")
```
What is the baseline category for the model?
Write the model for the odds that a person rates themselves as having "Fair" health versus the baseline category.
Interpret the coefficient of `Age` and its 95% confidence interval in terms of the odds that a person rates themselves as having "Poor" health versus baseline category.
Interpret the coefficient of `PhysActiveYes` and its 95% confidence interval in terms of the odds that a person rates themselves as having "Very Good" health versus baseline category.
## Predictions
### Predictions: Probablities
```{r}
#calculate predicted probabilities
pred_probs <- as_tibble(predict(health_m, type = "probs")) %>%
mutate(obs_num = 1:n())
```
```{r}
pred_probs %>%
slice(101:105)
```
```{r}
health_m_aug <- inner_join(nhanes_adult, pred_probs,
by = "obs_num") %>%
select(obs_num, everything())
```
```{r}
health_m_aug %>%
glimpse()
```
### Predictions: Category of response
```{r}
health_m_aug <- health_m_aug %>%
mutate(pred_health = predict(health_m, type = "class"))
```
### Actual vs. predicted
```{r}
#rows = actual, columns = predicted
health_m_aug %>%
count(HealthGen, pred_health, .drop = FALSE) %>%
pivot_wider(names_from = pred_health, values_from = n)
```
Why do you think no observations were predicted to have a rating of "Excellent", "Fair", or "Poor"?