class: center, middle, inverse, title-slide # Logistic regression ## Model fit & Exploratory data analysis ### Dr. Maria Tackett ### 03.18.19 --- ### Announcements - HW 05 due Mon, March 25 at 11:59p - Project proposal due Fri, March 29 at 11:59p --- ### Packages ```r library(tidyverse) library(knitr) library(broom) library(NHANES) library(pROC) #ROC curves ``` --- ### Review - `\(Y\)`: binary response + 1: yes + 0: no - `\(Mean(Y) = p\)` - `\(Var(Y) = p(1-p)\)` - **Odds of "yes"**: `\(\omega = \frac{p}{1-p}\)` --- ### NHANES Data - [National Health and Nutrition Examination Survey](https://www.cdc.gov/nchs/nhanes/index.htm) is conducted by the National Center for Health Statistics (NCHS) - The goal is to *"assess the health and nutritional status of adults and children in the United States"* - This survey includes an interview and a physical examination --- ### NHANES Data - We will use the data from the <font class="vocab">`NHANES`</font> R package - Contains 75 variables for the 2009 - 2010 and 2011 - 2012 sample years - The data in this package is modified for educational purposes and should **not** be used for research - Original data can be obtained from the [NCHS website](https://www.cdc.gov/nchs/data_access/index.htm) for research purposes - Type <font class="vocab">`?NHANES`</font> in console to see list of variables and definitions --- ### NHANES: Physical Activity & Sleep - Does physical activity reduce the odds of having sleep trouble? - We will analyze the following variables: + <font class="vocab">`SleepTrouble`: </font>Participant has told doctor or other health professional they had trouble sleeping + <font class="vocab">`PhysActive`: </font>Participant does moderate to vigorous-intensity sports, fitness or recreational activities + <font class="vocab">`Age`: </font>Age at time of screening (in years). Participants 80 or older were recorded as 80. --- ### NHANES: Physical Activity & Sleep .small[ ```r set.seed(1234) nhanes <- NHANES %>% filter(Age > 18) %>% sample_n(500) nhanes %>% select(SleepTrouble, PhysActive, Age) %>% slice(1:5) ``` ``` ## # A tibble: 5 x 3 ## SleepTrouble PhysActive Age ## <fct> <fct> <int> ## 1 No Yes 33 ## 2 Yes No 65 ## 3 No Yes 52 ## 4 No Yes 34 ## 5 No Yes 58 ``` ] --- ### Logistic Regression Model - Suppose `\(P(Y_i=1|X_i) = p_i\)` and `\(P(Y_i=0|X_i) = 1-p_i\)` - The <font class="vocab3">logistic regression model </font> is `$$\log\Big(\frac{p_i}{1-p_i}\Big) = \beta_0 + \beta_1 X_i$$` <br> - `\(\log\Big(\frac{p_i}{1-p_i}\Big)\)` is called the <font class="vocab3">logit</font> function --- ### Logistic Regression Model `$$\log\Big(\frac{p_i}{1-p_i}\Big) = \beta_0 + \beta_1 X_i$$` <br> <br> - We can calculate `\(p_i\)` by solving the logit equation: `$$p_i = \frac{\exp\{\beta_0 + \beta_1 X_i\}}{1 + \exp\{\beta_0 + \beta_1 X_i\}}$$` --- ### Calculating predicted probability, `\(\hat{p}\)` `$$\begin{aligned}&\log\Big(\frac{p_i}{1-p_i}\Big) = \beta_0 + \beta_1 X_i\\ \Rightarrow \hspace{8mm} &\exp\bigg\{\log\Big(\frac{p_i}{1-p_i}\Big)\bigg\} = \exp\{\beta_0 + \beta_1 X_i\}\\ \Rightarrow \hspace{8mm} & \frac{p_i}{1-p_i} = \exp\{\beta_0 + \beta_1 X_i\} \\ \Rightarrow \hspace{8mm}&p_i = \frac{\exp\{\beta_0 + \beta_1 X_i\}}{1+\exp\{\beta_0 + \beta_1 X_i\}}\\\end{aligned}$$` --- ### Interpreting the intercept: `\(\beta_0\)` .alert[ `$$\log\Big(\frac{p_i}{1-p_i}\Big) = \beta_0 + \beta_1 X_i$$` ] -- - When `\(X=0\)`, log-odds of `\(Y\)` are `\(\beta_0\)` - Won't use this interpretation in practice - **When `\(X=0\)`, odds of `\(Y\)` are `\(\exp\{\beta_0\}\)`** --- ### Interpreting slope coefficient `\(\beta_1\)` .alert[ `$$\log\Big(\frac{p_i}{1-p_i}\Big) = \beta_0 + \beta_1 X_i$$` ] If `\(X\)` is a <u>quantitative</u> predictor - As `\(X_i\)` increases by 1 unit, log-odds of `\(Y\)` increases by `\(\beta_1\)` - **As `\(X_i\)` increases by 1 unit, the odds of `\(Y\)` multiply by a factor of `\(\exp\{\beta_1\}\)`** -- If `\(X\)` is a <u>categorical</u> predictor - The difference in the log-odds between group `\(X\)` and the baseline is `\(\beta_1\)` - **The odds of `\(Y\)` for group `\(X\)` are expected to be `\(\exp\{\beta_1\}\)` times the odds of `\(Y\)` for the baseline group.** --- ### Estimating the Coefficients - Estimate coefficients using *maximum likelihood estimation* + covered in STA 250 and STA 360 -- - <font class="vocab">Basic Idea: </font> + Find values of `\(\beta_0\)` and `\(\beta_1\)` that make observed values of `\(Y\)` the most likely to have occurred + Use multivariable calculus and numerical methods to estimate coefficients -- - We will use R to estimate the coefficients --- ### Inference for Coefficients - <font class="vocab3">standard error, `\(SE(\hat{\beta}_1)\)`: </font>estimated standard deviation of the sampling distribution of `\(\hat{\beta}_1\)` - The center of the sampling distribution is `\(\beta_1\)`, the population coefficient - We can calculate the `\(\color{blue}{100(1-\alpha)\%}\)` <font color="blue">confidence interval</font> based on the large-sample Normal approximations + **CI for `\(\boldsymbol{\beta}_1\)`**: `$$\hat{\beta}_1 \pm z^* SE(\beta_1)$$` + **CI for `\(\exp\{\boldsymbol{\beta}_1\}\)`**: `$$\exp\{\hat{\beta}_1 \pm z^* SE(\beta_1)\}$$` --- ### Logistic Regression in R - Use the <font class="vocab">`glm`</font> function in R ```r my.model <- glm(Y ~ X, data = my.data, family = binomial) ``` - Display linear model with log-odds as the response ```r tidy(my.model, exponentiate = FALSE) ``` --- ### Age, Physical Activity, & Sleep .alert[ What is the relationship between age, physical activity, and the odds of having sleep trouble in adults? ] ```r # Use mean-centered age nhanes <- nhanes %>% mutate(ageCent = Age - mean(Age)) ``` ```r sleep_model <- glm(SleepTrouble ~ ageCent + PhysActive, data = nhanes, family = binomial) kable(tidy(sleep_model, exponentiate = FALSE, conf.int = TRUE), format = "markdown", digits = 3) ``` |term | estimate| std.error| statistic| p.value| conf.low| conf.high| |:-------------|--------:|---------:|---------:|-------:|--------:|---------:| |(Intercept) | -1.000| 0.149| -6.723| 0.000| -1.299| -0.714| |ageCent | 0.011| 0.006| 1.735| 0.083| -0.001| 0.023| |PhysActiveYes | -0.209| 0.210| -0.996| 0.319| -0.622| 0.203| --- ### Age, Physical activity, & Sleep trouble |term | estimate| std.error| statistic| p.value| conf.low| conf.high| |:-------------|--------:|---------:|---------:|-------:|--------:|---------:| |(Intercept) | -1.000| 0.149| -6.723| 0.000| -1.299| -0.714| |ageCent | 0.011| 0.006| 1.735| 0.083| -0.001| 0.023| |PhysActiveYes | -0.209| 0.210| -0.996| 0.319| -0.622| 0.203| .question[ 1. Interpret the intercept in terms of the **odds** of sleep trouble. 2. Interpret the coefficient of `PhysActive` in terms of the **odds** of sleep trouble. 3. Interpret the coefficient of `ageCent` in terms of the **odds** of sleep trouble. ] --- class: middle, center ## Model Fit --- ## Checking Model Fit **Residuals do not work well to assess model fit** - Residuals always positive when `\(Y=1\)` and negative when `\(Y=0\)` - Constant variance not an assumption of logistic regression - Normality of residuals not an assumption of logistic regression **To check function of predictors is well specified, i.e. that the model fits the data well, use...** - Confusion matrix (see Mar 6 notes) and ROC curve - Binned residuals --- ### Confusion Matrix - We can use the estimated probabilities to predict outcomes - *Ex.*: Establish a threshold such that `\(Y=1\)` if predicted probability is greater than the threshold (Y=0 otherwise) - Determine how many observations were classified correctly and incorrectly and put the results in a `\(2 \times 2\)` table + This table is the <font class="vocab3">confusion matrix</font> - If the proportion of misclassifications is high, then we conclude the model may not fit the data well --- ### Sensitivity & Specificity - <font class="vocab3">Sensitivity: </font>Proportion of observations with `\(Y=1\)` that have predicted probability above a specified threshold + Called true positive rate - <font class="vocab3">Specificity: </font>Proportion of observations with `\(Y=0\)` that have predicted probability below a specified threshold + (1 - specificity) called false positive rate - What we want: + High sensitivity + Low values of 1-specificity --- ### ROC Curve - **Receiver Operating Characteristic (ROC) curve**: + *X-axis*: `\(1 - \text{ specificity}\)` + *Y-axis*: `\(\text{ Sensitivity}\)` - Evaluated with a lot of different values for the threshold - Logistic model fits well if the area under the curve (AUC) is close to 1 - ROC in R - Use the <font class="vocab">`roc`</font> function in the `pROC` to calculate AUC - Use <font class="vocab">`geom_roc`</font> layer in ggplot to plot the ROC curve --- ### ROC Curve .small[ ```r sleep_aug <- augment(sleep_model, type.predict = "response", type.residuals = "response") ``` ```r library(plotROC) ggplot(sleep_aug, aes(d=SleepTrouble, m = .fitted)) + geom_roc(n.cuts = 0) + geom_abline(intercept = 0) ``` <img src="15-logistic-assumptions_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> ] --- ### ROC: Area under the curve ```r library(pROC) roc(sleep_aug$SleepTrouble,sleep_aug$.fitted)$auc ``` ``` ## Area under the curve: 0.5684 ``` <img src="15-logistic-assumptions_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- ### Not useful: Raw residuals vs. predicted - Include **`type.residuals = "response"`** in the `augment` function to get the raw residuals. `$$e_i = Y_i - \hat{p}_i$$` <img src="15-logistic-assumptions_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- ### Binned Residuals - It is not useful to plot the raw residuals, so we will examine binned residual plots **When examining binned residuals** - Look for patterns - Nonlinear trend may be indication that squared term or log transformation of predictor variable required - If bins have average residuals with large magnitude + Look at averages of other predictor variables across bins + Interaction may be required if large magnitude residuals correspond to certain combinations of predictor variables --- ### Binned plot vs. predicted values - Use the <font class="vocab">`binnedplot`</font> function in the **arm** package. - *Tip: Don't load the **arm** package to avoid conflicts with tidyverse* .small[ ```r arm::binnedplot(x=sleep_aug$.fitted,y=sleep_aug$.resid,xlab="Predicted Probabilities", main = "Binned Residual vs. Predicted Values") ``` <img src="15-logistic-assumptions_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> ] --- ### Making binned residual plot - Calculate raw residuals - Order observations either by the values of the predicted probabilities (or by numeric predictor variable) - Use the ordered data to create *g* bins of approximately equal size. Default value: `\(g = \sqrt{n}\)` - Calculate average residual value in each bin - Plot average residuals vs. average predicted probability (or average predictor value) --- ### Residuals vs. numerical predictors Make binned plot with predictor on `\(x\)` axis ```r arm::binnedplot(x=sleep_aug$ageCent,y=sleep_aug$.resid,xlab="AgeCent", main = "Binned Residual vs. Age") ``` <img src="15-logistic-assumptions_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- ### Residuals vs. categorical predictors - Calculate average residual for each level of the predictor - Are all means close to 0? If not, there is a problem with model fit. ```r sleep_aug %>% group_by(PhysActive) %>% summarise(mean_resid = mean(.resid)) ``` ``` ## # A tibble: 2 x 2 ## PhysActive mean_resid ## <fct> <dbl> ## 1 No -2.38e-13 ## 2 Yes -1.48e-13 ``` --- ### Residuals Let's look at the binned residuals versus `AgeCent` separately for those who are physically active and those who aren't. <img src="15-logistic-assumptions_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /><img src="15-logistic-assumptions_files/figure-html/unnamed-chunk-16-2.png" style="display: block; margin: auto;" /> --- ### Model with interaction term |term | estimate| std.error| statistic| p.value| conf.low| conf.high| |:---------------------|--------:|---------:|---------:|-------:|--------:|---------:| |(Intercept) | -1.091| 0.159| -6.845| 0.000| -1.414| -0.788| |ageCent | 0.028| 0.009| 3.154| 0.002| 0.011| 0.046| |PhysActiveYes | -0.165| 0.220| -0.752| 0.452| -0.596| 0.267| |ageCent:PhysActiveYes | -0.037| 0.013| -2.832| 0.005| -0.063| -0.012| .question[ 1. What is the effect of age on the odds of having sleep trouble for adults who are not physically active? 2. What is the effect of age on the odds of having sleep trouble for adults who are physically active? 3. Is the effect of age on having sleep trouble significantly different for the two groups? ] --- ### Binned residuals <img src="15-logistic-assumptions_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- ### Binned Residuals <img src="15-logistic-assumptions_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> ``` ## # A tibble: 2 x 2 ## PhysActive mean_resid ## <fct> <dbl> ## 1 No -3.65e-15 ## 2 Yes -3.19e-13 ``` --- ### Binned residuals <img src="15-logistic-assumptions_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /><img src="15-logistic-assumptions_files/figure-html/unnamed-chunk-21-2.png" style="display: block; margin: auto;" /> --- class: middle, center ## Exploratory Data Analysis --- ### Exploratory Data Analysis **Categorical predictors**: - Examine the percentage of `\(Y=1\)` for each level (category) - You can visualize using a stacked bar chart **Quantitative predictors**: - Order the predictor variable and divide into `\(g\)` groups of equal size - Calculate percentage `\(Y=1\)` in each group - See if there's a pattern (e.g. is there a quadratic pattern?) - Use `binnedplot` function --- ### EDA: Sleep trouble vs. Age .small[ ```r # create 1/0 indicator for response variable to use binnedplot nhanes <- nhanes %>% mutate(sleep_tr_ind = if_else(SleepTrouble == "Yes",1,0)) arm::binnedplot(x = nhanes$Age, y = nhanes$sleep_tr_ind, xlab = "Age", ylab = "proportion", main = "Sleep Trouble vs. Age", col.int = "white" #remove intervals ) ``` <img src="15-logistic-assumptions_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> ] --- ### Sleep trouble vs. Physical activity ```r ggplot(data = nhanes, aes(x = PhysActive, fill = SleepTrouble)) + geom_bar(position = "fill") + labs(y = "proportion", title = "Sleep trouble vs. Physical activity") ``` <img src="15-logistic-assumptions_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> --- ### Sleep trouble vs. Age & Physical activity <img src="15-logistic-assumptions_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /><img src="15-logistic-assumptions_files/figure-html/unnamed-chunk-24-2.png" style="display: block; margin: auto;" />