class: center, middle, inverse, title-slide # Logistic regression ### Dr. Maria Tackett ### 03.06.19 --- ### Announcements - Extra credit due Monday, March 18 --- ### Packages ```r library(tidyverse) library(knitr) library(broom) library(fivethirtyeight) library(pROC) #ROC curves library(questionr) #odds ratio function ``` --- ### Review - `\(Y\)`: binary response + 1: yes + 0: no - `\(Mean(Y) = p\)` - `\(Var(Y) = p(1-p)\)` - **Odds of "yes"**: `\(\omega = \frac{p}{1-p}\)` --- ### Comparing Odds Suppose we have two independent groups with odds `\(\omega_1\)` and `\(\omega_2\)` - <font class="vocab3">Odds Ratio:</font> `\(\phi = \frac{\omega_1}{\omega_2}\)` - Use inference to assess if groups have equal odds, i.e. `\(\phi = 1\)` + <font class="vocab3">Hypothesis Test:</font> `$$H_0: \log(\phi) = 0$$` + <font class="vocab3">Confidence Interval:</font> `$$\exp\Big\{\log(\phi) \pm z^* SE(\log(\phi))\Big\}$$` --- ### Is it rude to recline your seat on a plane? ```r flying <- fivethirtyeight::flying %>% drop_na(recline_rude, height, age) %>% mutate(rude = if_else(recline_rude %in% c("Somewhat", "Very"), 1, 0), rude = factor(rude), age = factor(age, order = FALSE)) # to display in model correctly ``` - **`height`**: self-reported height in feet and inches - **`age`**: 18-29, 30-44, 45-60, > 60 - **`rude`**: 1: yes, 0: no (Is it rude to recline your seat on a plane?) <br><br> Source: [*41 Percent of Fliers Think You're Rude If You Recline Your Seat*](https://fivethirtyeight.com/features/airplane-etiquette-recline-seat/) --- ### Opinions about flying |age | 0| 1| |:-----|---:|--:| |18-29 | 78| 94| |30-44 | 143| 79| **Is there a significant difference in the proportion of 18-29 year olds versus 30-44 year olds who think reclining a seat on a plane is rude?** --- ##`odds.ratio` function - We will use the <font class="vocab">`odds.ratio`</font> function in the **questionr** package to compute odds ratios and the corresponding confidence interval ```r #calculate odds ratio and 95% confidence interval flying %>% filter(age %in% c("18-29", "30-44")) %>% glm(rude ~ age, data = ., family = binomial) %>% odds.ratio(level=0.95) %>% kable(format="markdown", digits = 3) ``` | | OR| 2.5 %| 97.5 %| p| |:-----------|-----:|-----:|------:|-----:| |(Intercept) | 1.205| 0.893| 1.630| 0.223| |age30-44 | 0.458| 0.304| 0.687| 0.000| **We are 95% confident that the interval 0.304 to 0.687 contains the true odds ratio of 30-44 year olds versus 18-29 year olds who think reclining a seat.** --- class: middle, center <font class="vocab">Looking at the odds ratio is useful; however, we want to build a model to incorporate more variables that could potentially explain the odds of a flier having the opinion that reclining a seat is rude.</font> --- ### Linear model? - We want to use a model to predict a binary response `\(Y\)` -- - Suppose we use a linear regression model to predict `\(Y\)` using some explanatory variable `\(X\)` `$$Y_i = \beta_0 + \beta_1X_{i} + \epsilon_i \hspace{10mm} \epsilon_i \sim N(0,\sigma^2)$$` -- - This model assumes that `\(Y\)` could be any continuous value; however, it can only be 0 or 1 -- - So linear regression is **not** appropriate --- ### Other model choices Let `\(P(Y_i=1|X_i) = p_i\)` and `\(P(Y_i=0|X_i) = 1-p_i\)` <br> <br> -- Potential models for `\(p_i\)`: <br> -- - **<font class="vocab">Linear:</font>** `\(p_i = \beta_0 + \beta_1 X_i\)` + could predict that `\(p_i\)` is outside of `\((0,1)\)` -- - **<font class="vocab">Log-linear:</font>** `\(\log(p_i) =\beta_0 + \beta_1 X_i\)` + could predict that `\(p_i\)` is greater than 1 --- ### Logistic Regression Model - Suppose `\(P(Y_i=1|X_i) = p_i\)` and `\(P(Y_i=0|X_i) = 1-p_i\)` - The <font class="vocab3">logistic regression model </font> is `$$\log\Big(\frac{p_i}{1-p_i}\Big) = \beta_0 + \beta_1 X_i$$` <br> - `\(\log\Big(\frac{p_i}{1-p_i}\Big)\)` is called the <font class="vocab3">logit</font> function --- ### Logistic Regression Model .alert[ `$$\log\Big(\frac{p_i}{1-p_i}\Big) = \beta_0 + \beta_1 X_i$$` ] <br> - We can calculate `\(p_i\)` by solving the logit equation: `$$p_i = \frac{\exp\{\beta_0 + \beta_1 X_i\}}{1 + \exp\{\beta_0 + \beta_1 X_i\}}$$` --- ### Solving Logit Equation `$$\begin{aligned}&\log\Big(\frac{p_i}{1-p_i}\Big) = \beta_0 + \beta_1 X_i\\[15pt] \Rightarrow \hspace{8mm} &\exp\bigg\{\log\Big(\frac{p_i}{1-p_i}\Big)\bigg\} = \exp\{\beta_0 + \beta_1 X_i\}\\[15pt] \Rightarrow \hspace{8mm} & \frac{p_i}{1-p_i} = \exp\{\beta_0 + \beta_1 X_i\} \\[15pt] \Rightarrow \hspace{8mm}&p_i = \frac{\exp\{\beta_0 + \beta_1 X_i\}}{1+\exp\{\beta_0 + \beta_1 X_i\}}\\\end{aligned}$$` --- ### Interpreting the intercept: `\(\beta_0\)` .alert[ `$$\log\Big(\frac{p_i}{1-p_i}\Big) = \beta_0 + \beta_1 X_i$$` ] -- - When `\(X=0\)`, log-odds of `\(Y\)` are `\(\beta_0\)` - Won't use this interpretation in practice - **When `\(X=0\)`, odds of `\(Y\)` are `\(\exp\{\beta_0\}\)`** --- ### Interpreting slope coefficient `\(\beta_1\)` .alert[ `$$\log\Big(\frac{p_i}{1-p_i}\Big) = \beta_0 + \beta_1 X_i$$` ] If `\(X\)` is a <u>quantitative</u> predictor - As `\(X_i\)` increases by 1 unit, log-odds of `\(Y\)` increases by `\(\beta_1\)` - **As `\(X_i\)` increases by 1 unit, the odds of `\(Y\)` multiply by a factor of `\(\exp\{\beta_1\}\)`** -- If `\(X\)` is a <u>categorical</u> predictor - The difference in the log-odds between group `\(X\)` and the baseline is `\(\beta_1\)` - **The odds of `\(Y\)` for group `\(X\)` are expected to be `\(\exp\{\beta_1\}\)` times the odds of `\(Y\)` for the baseline group.** --- ### Inference for coefficients - The standard error is the estimated standard deviation of the sampling distribution of `\(\hat{\beta}_1\)` - We can calculate the `\(\color{blue}{C%}\)` <font color="blue">confidence interval</font> based on the large-sample Normal approximations - **CI for `\(\boldsymbol{\beta}_1\)`**: `$$\hat{\beta}_1 \pm z^* SE(\hat{\beta}_1)$$` .alert[ **CI for `\(\exp\{\boldsymbol{\beta}_1\}\)`**: `$$\exp\{\hat{\beta}_1 \pm z^* SE(\hat{\beta}_1)\}$$` ] --- ### Estimating the coefficients - Estimate coefficients using **maximum likelihood estimation** + covered in STA 250 and STA 360 <br> -- - <font class="vocab">Basic Idea: </font> + Find values of `\(\beta_0\)` and `\(\beta_1\)` that make observed values of `\(Y\)` the most likely to have occurred + Use multivariable calculus and numerical methods to estimate coefficients -- - In this class, we will use R to estimate the coefficients --- ## Logistic regression in R - Fit a logistic model using the Use the <font class="vocab">`glm()`</font> function - Set <font class="vocab">`family=binomial`</font> for a binary response variable ```r my.model <- glm(Y ~ X1 + ... + XP, data = my.data, family = binomial) ``` - Display model with log-odds as the response ```r tidy(my.model, exponentiate = FALSE) ``` - Display model with odds as response ```r tidy(my.model, exponentiate = TRUE) ``` --- ### Opinions about flying We want to use height to predict whether a flier will think reclining a seat on an airplane is rude. To do so, we will recode height so it's quantitative. ```r flying <- flying %>% separate(height, c("feet", "inches"), remove = FALSE) %>% mutate(height_in = case_when( height == "Under 5 ft." ~ 60, TRUE ~ as.numeric(feet)*12 + as.numeric(inches))) ``` ```r flying %>% select(height, height_in) %>% slice(1:5) ``` ``` ## # A tibble: 5 x 2 ## height height_in ## <ord> <dbl> ## 1 "6'3\"" 75 ## 2 "5'8\"" 68 ## 3 "5'11\"" 71 ## 4 "5'7\"" 67 ## 5 "5'9\"" 69 ``` --- ## Reclining vs. height Use the mean-centered `height` in the model, so the intercept will have a meaningful interpretation ```r flying <- flying %>% mutate(heightCent = height_in - mean(height_in)) ``` -- ```r ht_model <- glm(rude ~ heightCent, data = flying, family = binomial) kable(tidy(ht_model, exponentiate = FALSE, conf.int = TRUE), format = "markdown", digits = 3) ``` |term | estimate| std.error| statistic| p.value| conf.low| conf.high| |:-----------|--------:|---------:|---------:|-------:|--------:|---------:| |(Intercept) | -0.348| 0.070| -4.970| 0.000| -0.485| -0.211| |heightCent | 0.012| 0.018| 0.703| 0.482| -0.022| 0.047| --- ## Reclining vs. height |term | estimate| std.error| statistic| p.value| conf.low| conf.high| |:-----------|--------:|---------:|---------:|-------:|--------:|---------:| |(Intercept) | -0.348| 0.070| -4.970| 0.000| -0.485| -0.211| |heightCent | 0.012| 0.018| 0.703| 0.482| -0.022| 0.047| - For each additional inch taller a flier is, the odds they think reclining the seat on a plane is rude are expected to multiply by a factor of 1.013), with 95% confidence interval 0.978 to 1.049. - The odds a flier of average height thinks reclining the seat on a plane is rude are 0.706 to 1, with 95% confidence interval 0.615 to 0.81. -- .question[ Is `height` a significant predictor of whether a flier thinks reclining the seat is rude? ] --- ## Reclining vs. height & age ```r ht_age_model <- glm(rude ~ heightCent + age, data = flying, family = binomial) kable(tidy(ht_age_model, exponentiate = FALSE, conf.int = TRUE), format = "markdown", digits = 3) ``` |term | estimate| std.error| statistic| p.value| conf.low| conf.high| |:-----------|--------:|---------:|---------:|-------:|--------:|---------:| |(Intercept) | 0.188| 0.153| 1.226| 0.220| -0.112| 0.490| |heightCent | 0.013| 0.018| 0.741| 0.459| -0.022| 0.049| |age30-44 | -0.782| 0.208| -3.766| 0.000| -1.192| -0.377| |age45-60 | -0.590| 0.203| -2.901| 0.004| -0.990| -0.193| |age> 60 | -0.669| 0.208| -3.216| 0.001| -1.078| -0.263| .question[ 1. Interpret the coefficient of `age30-44` in the context of the data. 2. Describe the relationship between a flier's age and the odds they think reclining the seat on a plane is rude. ] --- class: middle, center ## Predictions & Model Fit --- ## Predictions - We are often interested in predicting if a given observation will have a "yes" response - To do so, we will use the logistic regression model to predict the probability of a "yes" response for the given observation. If we have one predictor variable, then... `$$p_i = \frac{\exp(\beta_0 + \beta_1 X_i)}{1 + \exp(\beta_0 + \beta_1 X_i)}$$` - We will use the predicted probabilities to classify the observation as having a "yes" or "no" response --- ### Will the passenger think I'm rude? - Suppose you want to recline your seat on an airplane, but you first want to determine if the passenger behind you will think you're rude. The passenger is about 6ft tall and around 35-40 years old. -- - Predicted log-odds that this passenger thinks reclining the seat is rude: `$$\log\bigg(\frac{\hat{p}_i}{1-\hat{p}_i}\bigg) = 0.188 + 0.013\times(72 - 67.44) - 0.782 = -0.534$$` -- - The probability this passenger thinks reclining the seat is rude: `$$\hat{p}_i = \frac{\exp\{ -0.534\}}{1 + \exp\{-0.534\}} = 0.3696$$` --- ### Predictions in R ```r x0 <- data_frame(heightCent = (72 - 67.44), age = "30-44") ``` - **Predicted log-odds** ```r predict(ht_age_model, x0) ``` ``` ## 1 ## -0.5337904 ``` - **Predicted probabilities** ```r predict(ht_age_model, x0, type = "response") ``` ``` ## 1 ## 0.3696333 ``` --- ### Will the passenger think I'm rude? ```r predict(ht_age_model, x0, type = "response") ``` ``` ## 1 ## 0.3696333 ``` The probability the passenger will think you're rude is 0.3696. .question[ Based on this probability, do you expect the passenger to think you're rude? Why or not why not? ] --- ### Confusion Matrix - We can use the estimated probabilities to predict outcomes - *Ex.*: Establish a threshold such that `\(Y=1\)` if predicted probability is greater than the threshold (Y=0 otherwise) - Determine how many observations were classified correctly and incorrectly and put the results in a `\(2 \times 2\)` table + This table is the <font class="vocab3">confusion matrix</font> - If the proportion of misclassifications is high, then we conclude the model may not fit the data well --- ### Confusion Matrix Suppose we use 0.5 as the threshold to classify responses ```r threshold <- 0.5 ht_age_aug <- augment(ht_age_model, type.predict = "response") ``` ```r ht_age_aug %>% mutate(rude_predict = if_else(.fitted > threshold, "Yes", "No")) %>% group_by(rude, rude_predict) %>% summarise(n = n()) %>% spread(rude, n) %>% kable(format="markdown") ``` |rude_predict | 0| 1| |:------------|---:|---:| |No | 416| 255| |Yes | 78| 94| --- ## Confusion matrix |rude_predict | 0| 1| |:------------|---:|---:| |No | 416| 255| |Yes | 78| 94| <br><br> .question[ What proportion of observations were misclassified?] --- ### Sensitivity & Specificity - <font class="vocab3">Sensitivity: </font>Proportion of observations with `\(Y=1\)` that have predicted probability above a specified threshold + Called true positive rate - <font class="vocab3">Specificity: </font>Proportion of observations with `\(Y=0\)` that have predicted probability below a specified threshold + (1 - specificity) called false positive rate - What we want: + High sensitivity + Low values of 1-specificity --- class: regular ### ROC Curve - <font class="vocab3">Receive Operating Characteristic (ROC) curve </font>: + *X-axis*: `\(1 - \text{ specificity}\)` + *Y-axis*: `\(\text{ Sensitivity}\)` - Evaluated with a lot of different values for the threshold - Logistic model fits well if the area under the curve (AUC) is close to 1 - ROC in R - Use the <font class="vocab">`roc`</font> function in the `pROC` to calculate AUC - Use <font class="vocab">`geom_roc`</font> layer in ggplot to plot the ROC curve --- ### Visualize ROC curve ```r library(plotROC) #extension of ggplot2 ggplot(ht_age_aug, aes(d = as.numeric(rude), m = .fitted)) + geom_roc(n.cuts = 0) + geom_abline(intercept = 0) ``` <img src="14-logistic_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> --- ## Area under curve ```r library(pROC) roc(ht_age_aug$rude,ht_age_aug$.fitted)$auc ``` ``` ## Area under the curve: 0.5719 ``` <img src="14-logistic_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" />