class: center, middle, inverse, title-slide # Multinomial Logistic Regression ### Dr. Maria Tackett ### 03.27.19 --- ### Announcements - Lab 07 due **today** - Project proposal due Fri, March 29 --- ## ROC curve & thresholds <img src="18-multinomial-logistic-pt2_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> .small[ ```r library(pROC) roc_results <- roc(model_aug$y,model_aug$.fitted) roc_results$thresholds[2546] #2546 is the index for the threshold that has the chosen true positive rate = 0.6 and chosen false positive rate = 0.25 ``` ``` ## [1] 0.207046 ``` ```r roc_results$sensitivities[2546] ``` ``` ## [1] 0.6050269 ``` ```r 1 - roc_results$specificities[2546] ``` ``` ## [1] 0.2499194 ``` ] --- ### Confusion matrix at chosen threshold | TenYearCHD| At Risk| No Risk| |----------:|-------:|-------:| | 0| 775| 2326| | 1| 337| 220| <br> The true positive rate is `\(\frac{337}{337 + 220} = 0.6050\)` <br> The false positive rate is `\(\frac{775}{775 + 2326} = 0.2499\)` --- class: middle, center ## Multinomial Logistic Regression --- ### Generalized Linear Models (GLM) - In practice, there are many different types of response variables including: + **Binary**: Win or Lose + **Nominal**: Democrat, Republican or Third Party candidate + **Ordered**: Movie rating (1 - 5 stars) + and others... - These are all examples of **generalized linear models**, a broader class of models that generalize the multiple linear regression model - See [*Generalized Linear Models: A Unifying Theory*](https://bookdown.org/roback/bookdown-bysh/ch-glms.html) for more details about GLMs --- ### Binary Response (Logistic) - Given `\(P(Y_i=1|X_i)=p_i\hspace{5mm} \text{ and } \hspace{5mm}P(Y_i=0|X_i) = 1-p_i\)` `$$\log\Big(\frac{p_i}{1-p_i}\Big) = \beta_0 + \beta_1 X_i$$` <br> - We can calculate `\(p_i\)` by solving the logit equation: `$$p_i = \frac{\exp\{\beta_0 + \beta_1 X_i\}}{1 + \exp\{\beta_0 + \beta_1 X_i\}}$$` --- ### Binary Response (Logistic) - Suppose we consider `\(Y=0\)` the *<font color="blue">baseline category</font>* such that `$$P(Y_i=0|X_i) = p_{i0} \text{ and } P(Y_i=1|X_i) = p_{i1}$$` -- - Then the logit model is `$$\log\bigg(\frac{p_{i1}}{p_{i0}}\bigg) = \beta_0 + \beta_1 X_i$$` -- - <font class="vocab">Slope, `\(\beta_1\)`</font>: When `\(X\)` increases by one unit, the odds of `\(Y=1\)` versus the baseline `\(Y=0\)` are expected to multiply by a factor of `\(\exp\{\beta_1\}\)` - <font class="vocab3">Intercept, `\(\beta_0\)`</font>: When `\(X=0\)`, the odds of `\(Y=1\)` versus the baseline `\(Y=0\)` are expected to be `\(\exp\{\beta_0\}\)` --- ### Multinomial Logistic Regression - <font class="vocab">Multinomial Distribution: </font> `$$P(Y=1) = p_1, P(Y=2) = p_2, \ldots, P(Y=k) = p_k$$` such that `\(\sum\limits_{j=1}^{k} p_j = 1\)` <br> <br> - Suppose we have no explanatory variables. What would be the best estimate of `\(p_j\)`? --- ### Multinomial Logistic Regression - If we have an explanatory variable `\(X\)`, then we want `\(p_j\)` to be a function of `\(X\)` - Choose a baseline category. Let's choose `\(Y=1\)`. Then, `$$\log\bigg(\frac{p_{ij}}{p_{i1}}\bigg) = \beta_{0j} + \beta_{1j} X_i$$` <br> -- - We have a separate model for each category of the response --- ### Multinomial Logistic Regression - Suppose we have a response variable `\(Y\)` that can take three possible outcomes that are coded as 1,2,3 - Let 1 be the baseline category. Then `$$\log\bigg(\frac{p_{i2}}{p_{i1}}\bigg) = \beta_{02} + \beta_{12} X_i \\[10pt] \log\bigg(\frac{p_{i3}}{p_{i1}}\bigg) = \beta_{03} + \beta_{13} X_i$$` - <font class="vocab">Slope, `\(\beta_{1j}\)`</font>: When `\(X\)` increases by one unit, the odds of `\(Y=j\)` versus the baseline are expected to multiply by a factor of `\(\exp\{\beta_{1j}\}\)` - <font class="vocab3">Intercept, `\(\beta_{0j}\)`</font>: When `\(X=0\)`, the odds of `\(Y=j\)` versus the baseline are expected to be `\(\exp\{\beta_{0j}\}\)` --- ### Multinomial Regression in R - Use the <font class="vocab">`multinom()`</font> function in the `nnet` package ```r library(nnet) my.model <- multinom(Y ~ X1 + X2 + ... + XP,data=my.data) tidy(my.model, exponentiate = FALSE) #display log-odds model ``` <br> ```r # calculate predicted probabilities pred.probs <- predict(my.model, type = "probs") ``` --- ### Iris Data - To illustrate multinomial logistic regression, we will use the <font class="vocab">`iris`</font> data set. - **Response:** + <font class="vocab">Species: </font>versicolor, setosa, virginica - **Predictors** (all measured in centimeters): + <font class="vocab">Sepal.Length</font> + <font class="vocab">Sepal.Width</font> + <font class="vocab">Petal.Length</font> + <font class="vocab">Petal.Width</font> --- ### Iris Data ```r glimpse(iris) ``` ``` ## Observations: 150 ## Variables: 5 ## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5… ## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3… ## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1… ## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0… ## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, set… ``` --- ### Iris Data <center> <img src="img/17/iris_parts.png" width="70%" style="display: block; margin: auto;" /> </center> --- ### Iris Data - <font class="vocab">Goal: </font> Use characteristics of the petals and sepal to distinguish between the three species <center> <img src="img/17/iris_type.jpg" width="80%" style="display: block; margin: auto;" /> </center> --- ### Iris: Model **Suppose we have already done exploratory data analysis.** <br> ```r library(nnet) model1 <- multinom(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris) ``` --- ### Iris: Model ```r kable(tidy(model1, exponentiate = FALSE), format = "markdown", digits = 3) ``` |y.level |term | estimate| std.error| statistic| p.value| |:----------|:------------|--------:|---------:|---------:|-------:| |versicolor |(Intercept) | 18.690| 34.971| 0.534| 0.593| |versicolor |Sepal.Length | -5.458| 89.892| -0.061| 0.952| |versicolor |Sepal.Width | -8.707| 157.042| -0.055| 0.956| |versicolor |Petal.Length | 14.245| 60.192| 0.237| 0.813| |versicolor |Petal.Width | -3.098| 45.489| -0.068| 0.946| |virginica |(Intercept) | -23.836| 35.766| -0.666| 0.505| |virginica |Sepal.Length | -7.924| 89.912| -0.088| 0.930| |virginica |Sepal.Width | -15.371| 157.120| -0.098| 0.922| |virginica |Petal.Length | 23.660| 60.468| 0.391| 0.696| |virginica |Petal.Width | 15.135| 45.934| 0.330| 0.742| --- ### Iris: Interpreting Coefficients .question[ 1. What is the baseline category? 2. Write the model for the log odds. 3. Interpret the coefficient for `Sepal.Length` in terms of the odds that an iris is a `versicolor` versus the baseline species. ] --- ## Calculating Probabilities - For `\(j=2,\ldots,k\)`, we calculate the probabilities, `\(p_{ij}\)` as `$$p_{ij} = \frac{\exp\{\beta_{0j} + \beta_{1j}X_i\}}{1 + \sum\limits_{j=2}^k \exp\{\beta_{0j} + \beta_{1j}X_i\}}$$` <br> <br> - For the baseline category, `\(j=1\)`, we calculate the probability `\(p_{i1}\)` as `$$p_{i1} = 1- \sum\limits_{j=2}^k p_{ij}$$` --- ### Model Diagnostics For each category of the response, `\(j\)`: - Analyze a plot of the binned residuals vs. predicted probabilities - Analyze a plot of the binned residuals vs. each continuous predictor variable - Look for any patterns in the residuals plots - For each categorical predictor variable, examine the average residuals for each category of the response variable --- ### Drop-in-deviance Test - Suppose there are two models: - Model 1 includes predictors `\(x_1, \ldots, x_q\)` - Model 2 includes predictors `\(x_1, \ldots, x_q, x_{q+1}, \ldots, x_p\)` - We want to test the hypotheses `$$\begin{aligned}&H_0: \beta_{q+1} = \dots = \beta_p = 0 \\ & H_a: \text{ at least 1 }\beta_j \text{ is not} 0\end{aligned}$$` - Use the **drop-in-deviance test** to compare models (similar to logistic regression) --- ### Iris: Predicted probabilities & residuals ```r #calculate predicted probabilities pred_probs <- data.frame(predict(model1, type = "probs")) ``` -- ```r iris <- iris %>% mutate(virginica = ifelse(Species == "virginica", 1, 0), versicolor = ifelse(Species == "versicolor", 1, 0), setosa = ifelse(Species == "setosa", 1, 0), resid_virginica = virginica - pred_probs$virginica, resid_versicolor = versicolor - pred_probs$versicolor, resid_setosa = setosa - pred_probs$setosa) ``` --- ### Binned Residuals vs. Predictors - Below are the plots for *Sepal.Length* These plots should be repeated for each explanatory variable .small[ <img src="18-multinomial-logistic-pt2_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> ] --- ### Interactions with Sepal.Length? We want to test if there are any significant interactions with `Sepal.Length` .small[ ```r #test for Sepal.Length interactions model1 <- multinom(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,data=iris) model2 <- multinom(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width + Sepal.Length * Sepal.Width + Sepal.Length * Petal.Length + Sepal.Length * Petal.Width, data=iris) ``` ] --- ### Interactions with Sepal.Length? ```r anova(model1, model2, test = "Chisq") ``` ``` ## Likelihood ratio tests of Multinomial Models ## ## Response: Species ## Model ## 1 Sepal.Length + Sepal.Width + Petal.Length + Petal.Width ## 2 Sepal.Length + Sepal.Width + Petal.Length + Petal.Width + Sepal.Length * Sepal.Width + Sepal.Length * Petal.Length + Sepal.Length * Petal.Width ## Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 290 11.899734 ## 2 284 9.279243 1 vs 2 6 2.620492 0.8547465 ``` <br> -- **No significant interactions with Sepal.Length** --- ### Actual vs. Predicted Species - We can use our model to predict the species - The predicted species is the one with the highest predicted probability ```r pred_species <- predict(model1, type = "class") table(iris$Species, pred_species) ``` ``` ## pred_species ## setosa versicolor virginica ## setosa 50 0 0 ## versicolor 0 49 1 ## virginica 0 1 49 ``` --- ### Example: *Sesame Street* - We will analyze data from an experiment to test the effectiveness of the children's program *Sesame Street*. - As part of the experiment, children were assigned to one of two groups: those who were encouraged to watch the program and those who were not - We want to understand what effect the encouragement had on the frequency of viewing after adjusting for other characteristics --- ### Response Variable - <font class="vocab">`viewcat`</font> + 1: rarely watched show + 2: once or twice a week + 3: three to five times a week + 4: watched show on average more than five times a week --- ### Predictor Variables - <font class="vocab">`age`:</font> child's age in months - <font class="vocab">`prenumb`: </font>score on numbers pretest (0 to 54) - <font class="vocab">`prelet`: </font>score on letters pretest (0 to 58) - <font class="vocab">`viewenc`:</font> 1: encouraged to watch, 2: not encouraged - <font class="vocab">`site:`</font> + 1: three to five year old from disadvantaged inner city area + 2: four year old from advantaged suburban area + 3: from advantaged rural area + 4: from disadvantaged rural area + 5: from Spanish speaking home .footnote[[Full data description](http://www2.stat.duke.edu/~jerry/sta210/sesamelab.html)] --- class: middle, center [*Sesame Street* analysis](https://www2.stat.duke.edu/courses/Spring19/sta210.001/appex/18-multinom-logistic.html)