--- title: "Multinomial Logistic Regression" date: "`r Sys.Date()`" output: html_document: theme: readable --- ```{r global-options, include=FALSE} knitr::opts_chunk$set(warning=FALSE, message=FALSE) ``` The goal of this exercise is to walk through a multinomial logistic regression analysis. It will give you a basic idea of the analysis steps and thought-process; however, due to class time constraints, this analysis is not exhaustive. ```{r load-packages} library(tidyverse) library(knitr) library(broom) library(patchwork) library(nnet) ``` This data is from the `NHANES` dataset found in the NHANES R package. Type `?NHANES` in the console to read more about the data and variables. We'll focus on the following variables for this analysis: - `HealthGen`: Self-reported rating of participant's health in general. Excellent, Vgood, Good, Fair, or Poor. - `Age`: Age at time of screening (in years). Participants 80 or older were recorded as 80. - `PhysActive`: Participant does moderate to vigorous-intensity sports, fitness or recreational activities ## Load and prepare data ```{r load-data} library(NHANES) nhanes_adult <- NHANES %>% filter(Age >= 18) %>% select(HealthGen, Age, PhysActive) %>% drop_na() %>% mutate(obs_num = 1:n()) ``` ```{r} glimpse(nhanes_adult) nhanes_adult <- nhanes_adult %>% mutate(HealthGen = fct_relevel(HealthGen, "Poor", "Fair", "Good", "Vgood", "Excellent")) ``` ## Exploratory data analysis ### Univariate ```{r} p1 <- ggplot(data = nhanes_adult, aes(x = Age)) + geom_histogram() + labs(title = "Distribution of Age") p2 <- ggplot(data = nhanes_adult, aes(x = PhysActive)) + geom_bar() + labs(title = "Moderate or vigorous sport or exercise") p3 <- ggplot(data = nhanes_adult, aes(x = HealthGen)) + geom_bar() + labs(title = "Self-reported rating of overall health") p3 + (p1 / p2) ``` ### Bivariate ```{r} p4 <- ggplot(data = nhanes_adult, aes(x = HealthGen, y = Age)) + geom_boxplot(fill = "steelblue") + labs(title = "Age vs. Health Rating") p5 <- ggplot(data = nhanes_adult, aes(x = HealthGen, fill = PhysActive)) + geom_bar(position = "fill") + labs(y = "Proportion", title = "Physical Activity vs. Health Rating") + coord_flip() p4 + p5 ``` ## Fit multinomial logistic model ```{r results = "hide"} health_m <- multinom(HealthGen ~ Age + PhysActive, data = nhanes_adult) ``` ```{r} tidy(health_m, conf.int = TRUE, exponentiate = FALSE) %>% kable(digits = 3, format = "markdown") ``` What is the baseline category for the model? Write the model for the odds that a person rates themselves as having "Fair" health versus the baseline category. Interpret the coefficient of `Age` and its 95% confidence interval in terms of the odds that a person rates themselves as having "Poor" health versus baseline category. Interpret the coefficient of `PhysActiveYes` and its 95% confidence interval in terms of the odds that a person rates themselves as having "Very Good" health versus baseline category. ## Predictions ### Predictions: Probablities ```{r} #calculate predicted probabilities pred_probs <- as_tibble(predict(health_m, type = "probs")) %>% mutate(obs_num = 1:n()) ``` ```{r} pred_probs %>% slice(101:105) ``` ```{r} health_m_aug <- inner_join(nhanes_adult, pred_probs, by = "obs_num") %>% select(obs_num, everything()) ``` ```{r} health_m_aug %>% glimpse() ``` ### Predictions: Category of response ```{r} health_m_aug <- health_m_aug %>% mutate(pred_health = predict(health_m, type = "class")) ``` ### Actual vs. predicted ```{r} #rows = actual, columns = predicted health_m_aug %>% count(HealthGen, pred_health, .drop = FALSE) %>% pivot_wider(names_from = pred_health, values_from = n) ``` Why do you think no observations were predicted to have a rating of "Excellent", "Fair", or "Poor"?