class: center, middle, inverse, title-slide # Logistic regression ## Exploratory data analysis & Drop-in-Deviance Test ### Dr. Maria Tackett ### 03.20.19 --- ### Announcements - HW 05 due Mon, March 25 at 11:59p - Project proposal due Fri, March 29 at 11:59p --- ### Packages ```r library(tidyverse) library(knitr) library(broom) library(NHANES) library(pROC) #ROC curves ``` --- ### NHANES Data - [National Health and Nutrition Examination Survey](https://www.cdc.gov/nchs/nhanes/index.htm) is conducted by the National Center for Health Statistics (NCHS) - The goal is to *"assess the health and nutritional status of adults and children in the United States"* - This survey includes an interview and a physical examination --- ### NHANES: Physical Activity & Sleep - Does physical activity reduce the odds of having sleep trouble? - We will analyze the following variables: + <font class="vocab">`SleepTrouble`: </font>Participant has told doctor or other health professional they had trouble sleeping + <font class="vocab">`PhysActive`: </font>Participant does moderate to vigorous-intensity sports, fitness or recreational activities + <font class="vocab">`Age`: </font>Age at time of screening (in years). Participants 80 or older were recorded as 80. + <font class="vocab">`Education`: </font>8th Grade, 9-11th Grade, High School, Some College, College Grad --- ### NHANES: Physical Activity & Sleep .small[ ``` ## # A tibble: 5 x 4 ## SleepTrouble PhysActive Age Education ## <fct> <fct> <int> <fct> ## 1 No No 30 9 - 11th Grade ## 2 No Yes 20 Some College ## 3 No Yes 27 Some College ## 4 Yes No 53 Some College ## 5 No Yes 58 College Grad ``` ] --- class: middle, center ## Exploratory Data Analysis --- ### Exploratory Data Analysis (EDA) **Categorical predictors**: - Examine the percentage of `\(Y=1\)` for each level (category) - You can visualize using a stacked bar chart **Quantitative predictors**: - Order the predictor variable and divide into `\(g\)` groups of equal size - For each group, plot proportion of `\(Y=1\)` versus the average value of the predictor - See if there's a pattern (e.g. is there a quadratic pattern?) - Use `binnedplot` function --- ### EDA: Sleep trouble vs. Age .small[ ```r # create 1/0 indicator for response variable to use binnedplot nhanes <- nhanes %>% mutate(sleep_tr_ind = if_else(SleepTrouble == "Yes",1,0)) arm::binnedplot(x = nhanes$Age, y = nhanes$sleep_tr_ind, xlab = "Age", ylab = "proportion", main = "Sleep Trouble vs. Age", col.int = "white" #remove intervals ) ``` <img src="16-logistic-eda_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] --- ### Sleep trouble vs. Physical Activity ```r ggplot(data = nhanes, aes(x = PhysActive, fill = SleepTrouble)) + geom_bar(position = "fill") + labs(y = "proportion", title = "Sleep trouble vs. Physical activity") ``` <img src="16-logistic-eda_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ### Interaction? - Plot Sleep trouble vs. Age for each level of Physical Activity <img src="16-logistic-eda_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /><img src="16-logistic-eda_files/figure-html/unnamed-chunk-5-2.png" style="display: block; margin: auto;" /> --- class: middle, center ### Drop-in-deviance test --- ### Comparing Nested Models - Suppose there are two models: - Model 1 includes predictors `\(x_1, \ldots, x_q\)` - Model 2 includes predictors `\(x_1, \ldots, x_q, x_{q+1}, \ldots, x_p\)` - We want to test the hypotheses `$$\begin{aligned}&H_0: \beta_{q+1} = \dots = \beta_p = 0 \\ & H_a: \text{ at least 1 }\beta_j \text{ is not} 0\end{aligned}$$` -- - We used a *Nested F Test* to compare two nested models in linear regression - We will use the <font class="vocab3">drop-in-deviance test</font> in logistic regression --- ### Deviance residual - The **deviance residual** is the a measure of how much the observed data differs from what is measured using the likelihood ratio - The deviance residual for the `\(i^{th}\)` observation is .alert[ `$$d_i = \text{sign}(Y_i - \hat{p}_i)\sqrt{2\bigg[Y_i \log\Big(\frac{Y_i}{p_i}\Big) + (1-Y_i)\log\Big(\frac{1-Y_i}{1-\hat{p}_i}\Big)\bigg]}$$` where `\(\text{sign}(Y_i - \hat{p}_i)\)` is positive when `\(Y_1 = 1\)` and negative when `\(Y_i = 0\)`. ] --- ### Drop-in-Deviance Test - The **deviance statistic** for Model `\(k\)` is `\(D_k = \sum\limits_{i=1}^n d_i^2\)` -- - To test the hypotheses `$$\begin{aligned}&H_0: \beta_{q+1} = \dots = \beta_p = 0 \\ & H_a: \text{ at least 1 }\beta_j \text{ is not} 0\end{aligned}$$` -- the <font class="vocab"> drop-in-deviance statistic </font> is `\(D_1 - D_2\)` -- - When the sample size is large, the drop-in-deviance statistic has an approximately Chi-squared distribution with degrees of freedom equal to the difference in number of predictor variables in Model 1 and Model 2 --- ### Should we add `Education` to the model? - Suppose - Model 1 includes `Age`, `PhysActive`, `Age*PhysActive` - Model 2 includes `Age`, `PhysActive`, `Age*PhysActive`, `Education` .small[ ```r model1 <- glm(SleepTrouble ~ Age + PhysActive + Age * PhysActive, data = nhanes, family = binomial) model2 <- glm(SleepTrouble ~ Age + PhysActive + Age * PhysActive + Education, data = nhanes, family = binomial) ``` ] -- ```r # Deviances (dev_model1 <- glance(model1)$deviance) ``` ``` ## [1] 553.2747 ``` ```r (dev_model2 <- glance(model2)$deviance) ``` ``` ## [1] 542.7319 ``` --- ### Should we add `Education` to the model? ```r # Drop-in-deviance test statistic (test_stat <- dev_model1 - dev_model2) ``` ``` ## [1] 10.54277 ``` -- ```r # p-value 1 - pchisq(test_stat, 4) ``` ``` ## [1] 0.03221292 ``` --- ### Should we add `Education` to the model? - We can use the **`anova`** function to conduct this test - Add **`test = "Chisq"`** to conduct the drop-in-deviance test ```r anova(model1, model2, test = "Chisq") ``` ``` ## Analysis of Deviance Table ## ## Model 1: SleepTrouble ~ Age + PhysActive + Age * PhysActive ## Model 2: SleepTrouble ~ Age + PhysActive + Age * PhysActive + Education ## Resid. Df Resid. Dev Df Deviance Pr(>Chi) ## 1 496 553.27 ## 2 492 542.73 4 10.543 0.03221 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- ## Framingham Study - We will analyze data from a cardiovascular study on residents in Framingham, MA - **Goal**: Predict whether or not a participant has a 10-year risk of future coronary heart disease - Original data contains information from 4,000+ participants. We will use 500 for this analysis.