class: center, middle, inverse, title-slide # Multinomial Logistic Regression ## Assumptions & Model Selection ### Prof. Maria Tackett ### 04.08.20 --- class: middle, center ### [Click for PDF of slides](17-multinom-logistic-pt2.pdf) --- class: middle, center ### Checking assumptions --- ### Assumptions for multinomial logistic regression We want to check the following assumptions for the multinomial logistic regression model: - .vocab[Linearity]: Is there a linear relationship between the log-odds and the predictor variables? <br> - .vocab[Randomness]: Was the sample randomly selected? Or can we reasonably treat it as random? <br> - .vocab[Independence]: There is no obvious relationship between observations --- ### Checking linearity For each category of the response, `\(j\)`: - Analyze a plot of the binned residuals vs. predicted probabilities - Analyze a plot of the binned residuals vs. each continuous predictor variable - Look for any patterns in the residuals plots - For each categorical predictor variables, examine the average residuals for each category of the predictor --- ### Randomness Assess randomness based on a description of the data collection - Was the sample randomly selected? - If the sample was not randomly selected, is there reason to believe the observations in the sample differ systematically from the population of interest? --- ### Independence Assess independence based on a description of the data collection - Is there an obvious relationship between observations? - This assumption is most often violated when data was collected over time or there is a spatial relationship between observations? --- ### NHANES Data - **Question**: Can we use a person's age and whether they do regular physical activity to predict their self-reported health rating? - Variables: + <font class="vocab">`HealthGen`: </font>Self-reported rating of participant's health in general. Excellent, Vgood, Good, Fair, or Poor. + <font class="vocab">`Age`: </font>Age at time of screening (in years). Participants 80 or older were recorded as 80. + <font class="vocab">`PhysActive`: </font>Participant does moderate to vigorous-intensity sports, fitness or recreational activities + .vocab[`Education`]: Categorical variable for highest education level attained --- ### Model |y.level |term | estimate| std.error| statistic| p.value| conf.low| conf.high| |:-------|:-------------|--------:|---------:|---------:|-------:|--------:|---------:| |Vgood |(Intercept) | 1.205| 0.145| 8.325| 0.000| 0.922| 1.489| |Vgood |Age | 0.001| 0.002| 0.369| 0.712| -0.004| 0.006| |Vgood |PhysActiveYes | -0.321| 0.093| -3.454| 0.001| -0.503| -0.139| |Good |(Intercept) | 1.948| 0.141| 13.844| 0.000| 1.672| 2.223| |Good |Age | -0.002| 0.002| -0.977| 0.329| -0.007| 0.002| |Good |PhysActiveYes | -1.001| 0.090| -11.120| 0.000| -1.178| -0.825| |Fair |(Intercept) | 0.915| 0.164| 5.566| 0.000| 0.592| 1.237| |Fair |Age | 0.003| 0.003| 1.058| 0.290| -0.003| 0.009| |Fair |PhysActiveYes | -1.645| 0.107| -15.319| 0.000| -1.856| -1.435| |Poor |(Intercept) | -1.521| 0.290| -5.238| 0.000| -2.090| -0.952| |Poor |Age | 0.022| 0.005| 4.522| 0.000| 0.013| 0.032| |Poor |PhysActiveYes | -2.656| 0.236| -11.275| 0.000| -3.117| -2.194| --- ### NHANES: Residuals ```r #calculate residuals residuals <- as_tibble(residuals(health_m)) %>% #calculate residuals setNames(paste('resid.', names(.), sep = "")) %>% #update column names mutate(obs_num = 1:n()) #add obs number ``` ```r residuals %>% slice(1:10) ``` ``` ## # A tibble: 10 x 6 ## resid.Excellent resid.Vgood resid.Good resid.Fair resid.Poor obs_num ## <dbl> <dbl> <dbl> <dbl> <dbl> <int> ## 1 -0.0707 -0.243 0.543 -0.196 -0.0328 1 ## 2 -0.0707 -0.243 0.543 -0.196 -0.0328 2 ## 3 -0.0707 -0.243 0.543 -0.196 -0.0328 3 ## 4 -0.0700 -0.244 0.563 -0.203 -0.0454 4 ## 5 -0.155 0.608 -0.360 -0.0859 -0.00648 5 ## 6 -0.155 0.608 -0.360 -0.0859 -0.00648 6 ## 7 -0.155 0.608 -0.360 -0.0859 -0.00648 7 ## 8 -0.156 0.600 -0.343 -0.0916 -0.0103 8 ## 9 -0.156 0.603 -0.349 -0.0894 -0.00865 9 ## 10 -0.156 -0.396 -0.353 0.912 -0.00791 10 ``` --- ### Make "augmented" dataset ```r #calculate predicted probabilities pred_probs <- as_tibble(predict(health_m, type = "probs")) %>% mutate(obs_num = 1:n()) ``` ```r health_m_aug <- inner_join(nhanes_adult, pred_probs) #add probs health_m_aug <- inner_join(health_m_aug, residuals) #add resid ``` ```r health_m_aug %>% glimpse() ``` ``` ## Rows: 6,710 ## Columns: 15 ## $ HealthGen <fct> Good, Good, Good, Good, Vgood, Vgood, Vgood, Vgood, V… ## $ Age <int> 34, 34, 34, 49, 45, 45, 45, 66, 58, 54, 50, 33, 60, 5… ## $ PhysActive <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No… ## $ Education <fct> High School, High School, High School, Some College, … ## $ obs_num <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16… ## $ Excellent <dbl> 0.07069715, 0.07069715, 0.07069715, 0.07003173, 0.155… ## $ Vgood <dbl> 0.2433979, 0.2433979, 0.2433979, 0.2444214, 0.3922335… ## $ Good <dbl> 0.4573727, 0.4573727, 0.4573727, 0.4372533, 0.3599639… ## $ Fair <dbl> 0.19568909, 0.19568909, 0.19568909, 0.20291032, 0.085… ## $ Poor <dbl> 0.032843150, 0.032843150, 0.032843150, 0.045383332, 0… ## $ resid.Excellent <dbl> -0.07069715, -0.07069715, -0.07069715, -0.07003173, -… ## $ resid.Vgood <dbl> -0.2433979, -0.2433979, -0.2433979, -0.2444214, 0.607… ## $ resid.Good <dbl> 0.5426273, 0.5426273, 0.5426273, 0.5627467, -0.359963… ## $ resid.Fair <dbl> -0.19568909, -0.19568909, -0.19568909, -0.20291032, -… ## $ resid.Poor <dbl> -0.032843150, -0.032843150, -0.032843150, -0.04538333… ``` --- ### Binned residuals vs. pred. probabilities <img src="17-multinom-logistic-pt2_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /><img src="17-multinom-logistic-pt2_files/figure-html/unnamed-chunk-11-2.png" style="display: block; margin: auto;" /> --- ### Binned residuals vs. `Age` <img src="17-multinom-logistic-pt2_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /><img src="17-multinom-logistic-pt2_files/figure-html/unnamed-chunk-12-2.png" style="display: block; margin: auto;" /> --- ### Residuals vs. `PhysActive` ```r health_m_aug %>% group_by(PhysActive) %>% summarise(mean.Excellent = mean(resid.Excellent), mean.Vgood = mean(resid.Vgood), mean.Good = mean(resid.Good), mean.Fair = mean(resid.Fair), mean.Poor = mean(resid.Poor)) %>% t() ``` ``` ## [,1] [,2] ## PhysActive "No" "Yes" ## mean.Excellent "-1.194022e-07" " 2.106514e-06" ## mean.Vgood " 1.644794e-06" "-1.871461e-06" ## mean.Good "-3.227820e-06" " 1.140886e-07" ## mean.Fair " 1.333924e-06" "-3.860756e-07" ## mean.Poor "3.685045e-07" "3.693412e-08" ``` --- ### Randomness & Independence - .vocab[Randomness]: - About the R dataset: *"NHANES can be treated, for educational purposes, as if it were a simple random sample from the American population."* - What about the actual NHANES data? Type `?NHANES` in the console to read more details about how participants are selected for the actual survey. - .vocab[Independence]: The participants are randomly selected within subpopulations, so the independence assumption is satisfied. --- class: middle, center ### Model Selection --- ### Comparing Nested Models - Suppose there are two models: - Reduced Model includes predictors `\(x_1, \ldots, x_q\)` - Full Model includes predictors `\(x_1, \ldots, x_q, x_{q+1}, \ldots, x_p\)` - We want to test the hypotheses `$$\begin{aligned}&H_0: \beta_{q+1} = \dots = \beta_p = 0 \\ & H_a: \text{ at least 1 }\beta_j \text{ is not} 0\end{aligned}$$` - To do so, we will use the .vocab[Drop-in-Deviance test] (very similar to logistic regression) --- ### Add `Education` to the model? - We consider adding the participants' `Education` level to the model. - Education takes values `8thGrade`, `9-11thGrade`, `HighSchool`, `SomeCollege`, and `CollegeGrad` - Models we're testing: - Reduced Model: `Age`, `PhysActive` - Full Model: `Age`, `PhysActive`, `Education` .alert[ `$$\begin{align}&H_0: \beta_{9-11thGrade} = \beta_{HighSchool} = \beta_{SomeCollege} = \beta_{CollegeGrad} = 0\\ &H_a: \text{ at least one }\beta_j \text{ is not equal to }0\end{align}$$` ] --- ### Add `Education` to the model? .alert[ `$$\begin{align}&H_0: \beta_{9-11thGrade} = \beta_{HighSchool} = \beta_{SomeCollege} = \beta_{CollegeGrad} = 0\\ &H_a: \text{ at least one }\beta_j \text{ is not equal to }0\end{align}$$` ] ```r model_red <- multinom(HealthGen ~ Age + PhysActive, data = nhanes_adult) model_full <- multinom(HealthGen ~ Age + PhysActive + Education, data = nhanes_adult) ``` --- ### Add `Education` to the model? ```r model_red <- multinom(HealthGen ~ Age + PhysActive, data = nhanes_adult) model_full <- multinom(HealthGen ~ Age + PhysActive + Education, data = nhanes_adult) ``` ```r anova(model_red, model_full, test = "Chisq") %>% kable(format = "markdown") ``` |Model | Resid. df| Resid. Dev|Test | Df| LR stat.| Pr(Chi)| |:----------------------------|---------:|----------:|:------|-----:|--------:|-------:| |Age + PhysActive | 25848| 16994.23| | NA| NA| NA| |Age + PhysActive + Education | 25832| 16505.10|1 vs 2 | 16| 489.1319| 0| -- At least one coefficient associated with `Education` is non-zero. Therefore, `Education` is a statistically significant predictor for `HealthGen`. --- ### Model with `Education` |y.level |term | estimate| std.error| statistic| p.value| conf.low| conf.high| |:-------|:-----------------------|--------:|---------:|---------:|-------:|--------:|---------:| |Vgood |(Intercept) | 0.582| 0.301| 1.930| 0.054| -0.009| 1.173| |Vgood |Age | 0.001| 0.003| 0.419| 0.675| -0.004| 0.006| |Vgood |PhysActiveYes | -0.264| 0.099| -2.681| 0.007| -0.457| -0.071| |Vgood |Education9 - 11th Grade | 0.768| 0.308| 2.493| 0.013| 0.164| 1.372| |Vgood |EducationHigh School | 0.701| 0.280| 2.509| 0.012| 0.153| 1.249| |Vgood |EducationSome College | 0.788| 0.271| 2.901| 0.004| 0.256| 1.320| |Vgood |EducationCollege Grad | 0.408| 0.268| 1.522| 0.128| -0.117| 0.933| |Good |(Intercept) | 2.041| 0.272| 7.513| 0.000| 1.508| 2.573| |Good |Age | -0.002| 0.003| -0.651| 0.515| -0.007| 0.003| |Good |PhysActiveYes | -0.758| 0.096| -7.884| 0.000| -0.946| -0.569| |Good |Education9 - 11th Grade | 0.360| 0.275| 1.310| 0.190| -0.179| 0.899| |Good |EducationHigh School | 0.085| 0.247| 0.345| 0.730| -0.399| 0.569| |Good |EducationSome College | -0.011| 0.239| -0.047| 0.962| -0.480| 0.457| |Good |EducationCollege Grad | -0.891| 0.236| -3.767| 0.000| -1.354| -0.427| |Fair |(Intercept) | 2.116| 0.288| 7.355| 0.000| 1.552| 2.680| |Fair |Age | 0.000| 0.003| 0.107| 0.914| -0.006| 0.006| |Fair |PhysActiveYes | -1.191| 0.115| -10.367| 0.000| -1.416| -0.966| |Fair |Education9 - 11th Grade | -0.224| 0.279| -0.802| 0.422| -0.771| 0.323| |Fair |EducationHigh School | -0.832| 0.252| -3.307| 0.001| -1.326| -0.339| |Fair |EducationSome College | -1.343| 0.246| -5.462| 0.000| -1.825| -0.861| |Fair |EducationCollege Grad | -2.509| 0.253| -9.913| 0.000| -3.005| -2.013| |Poor |(Intercept) | -0.200| 0.411| -0.488| 0.626| -1.005| 0.605| |Poor |Age | 0.018| 0.005| 3.527| 0.000| 0.008| 0.028| |Poor |PhysActiveYes | -2.267| 0.242| -9.377| 0.000| -2.741| -1.793| |Poor |Education9 - 11th Grade | -0.360| 0.353| -1.020| 0.308| -1.053| 0.332| |Poor |EducationHigh School | -1.150| 0.334| -3.438| 0.001| -1.805| -0.494| |Poor |EducationSome College | -1.073| 0.316| -3.399| 0.001| -1.692| -0.454| |Poor |EducationCollege Grad | -2.322| 0.366| -6.342| 0.000| -3.039| -1.604| --- ### Compare NHANES models using AIC ```r glance(model_red) ``` ``` ## # A tibble: 1 x 3 ## edf deviance AIC ## <dbl> <dbl> <dbl> ## 1 12 16994. 17018. ``` ```r glance(model_full) ``` ``` ## # A tibble: 1 x 3 ## edf deviance AIC ## <dbl> <dbl> <dbl> ## 1 28 16505. 16561. ``` - Use the `step()` function to do model selection with AIC as the selection criteria --- class: middle, center .question[ What questions do you have about multinomial logistic regression? ]