Logistic regression

# Logistic regression
## Exploratory data analysis & Drop-in-Deviance Test
### Dr. Maria Tackett
### 03.20.19

---

### Announcements

- HW 05 due Mon, March 25 at 11:59p

- Project proposal due Fri, March 29 at 11:59p

---

### Packages

```r
library(tidyverse)
library(knitr)
library(broom)
library(NHANES)
library(pROC) #ROC curves
```

---

### NHANES Data

- [National Health and Nutrition Examination Survey](https://www.cdc.gov/nchs/nhanes/index.htm) is conducted by the National Center for Health Statistics (NCHS)

- The goal is to *"assess the health and nutritional status of adults and children in the United States"*

- This survey includes an interview and a physical examination

---

### NHANES: Physical Activity & Sleep

- Does physical activity reduce the odds of having sleep trouble?

- We will analyze the following variables:

+ `SleepTrouble`: Participant has told doctor or other health professional they had trouble sleeping
 
 + `PhysActive`: Participant does moderate to vigorous-intensity sports, fitness or recreational activities
 
 + `Age`: Age at time of screening (in years). Participants 80 or older were recorded as 80.
 
 + `Education`: 8th Grade, 9-11th Grade, High School, Some College, College Grad
 
---

### NHANES: Physical Activity & Sleep

```
## # A tibble: 5 x 4
## SleepTrouble PhysActive Age Education 
## <fct> <fct> <int> <fct> 
## 1 No No 30 9 - 11th Grade
## 2 No Yes 20 Some College 
## 3 No Yes 27 Some College 
## 4 Yes No 53 Some College 
## 5 No Yes 58 College Grad
```
]

---

## Exploratory Data Analysis

---

### Exploratory Data Analysis (EDA)

**Categorical predictors**: 
- Examine the percentage of `$Y=1$` for each level (category)
- You can visualize using a stacked bar chart

**Quantitative predictors**:
- Order the predictor variable and divide into `$g$` groups of equal size
- For each group, plot proportion of `$Y=1$` versus the average value of the predictor
- See if there's a pattern (e.g. is there a quadratic pattern?)
- Use `binnedplot` function

---

### EDA: Sleep trouble vs. Age

```r
# create 1/0 indicator for response variable to use binnedplot
nhanes <- nhanes %>% 
 mutate(sleep_tr_ind = if_else(SleepTrouble == "Yes",1,0))

arm::binnedplot(x = nhanes$Age, y = nhanes$sleep_tr_ind, 
                xlab = "Age", ylab = "proportion",
                main = "Sleep Trouble vs. Age", 
                col.int = "white" #remove intervals
                )
```

<img src="16-logistic-eda_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" />
]

---

### Sleep trouble vs. Physical Activity

```r
ggplot(data = nhanes, aes(x = PhysActive, fill = SleepTrouble)) +
  geom_bar(position = "fill") + 
  labs(y = "proportion", 
       title = "Sleep trouble vs. Physical activity")
```

---

### Interaction?

- Plot Sleep trouble vs. Age for each level of Physical Activity

---

### Drop-in-deviance test

---

### Comparing Nested Models

- Suppose there are two models: 
    - Model 1 includes predictors `$x_1, \ldots, x_q$`
    - Model 2 includes predictors `$x_1, \ldots, x_q, x_{q+1}, \ldots, x_p$`

- We want to test the hypotheses
`$$\begin{aligned}&H_0: \beta_{q+1} = \dots = \beta_p = 0 \\
& H_a: \text{ at least 1 }\beta_j \text{ is not} 0\end{aligned}$$`

- We used a *Nested F Test* to compare two nested models in linear regression

- We will use the drop-in-deviance test in logistic regression

---

### Deviance residual

- The **deviance residual** is the a measure of how much the observed data differs from what is measured using the likelihood ratio

- The deviance residual for the `$i^{th}$` observation is

.alert[
`$$d_i = \text{sign}(Y_i - \hat{p}_i)\sqrt{2\bigg[Y_i \log\Big(\frac{Y_i}{p_i}\Big) + (1-Y_i)\log\Big(\frac{1-Y_i}{1-\hat{p}_i}\Big)\bigg]}$$`
where `$\text{sign}(Y_i - \hat{p}_i)$` is positive when `$Y_1 = 1$` and negative when `$Y_i = 0$`.
]

---

### Drop-in-Deviance Test

- The **deviance statistic** for Model `$k$` is `$D_k = \sum\limits_{i=1}^n d_i^2$`

- To test the hypotheses
`$$\begin{aligned}&H_0: \beta_{q+1} = \dots = \beta_p = 0 \\
& H_a: \text{ at least 1 }\beta_j \text{ is not} 0\end{aligned}$$`

the drop-in-deviance statistic is `$D_1 - D_2$`

- When the sample size is large, the drop-in-deviance statistic has an approximately Chi-squared distribution with degrees of freedom equal to the difference in number of predictor variables in Model 1 and Model 2

---

### Should we add `Education` to the model?

- Suppose 
    - Model 1 includes `Age`, `PhysActive`, `Age*PhysActive`
    - Model 2 includes `Age`, `PhysActive`, `Age*PhysActive`, `Education`

```r
model1 <- glm(SleepTrouble ~ Age + PhysActive + Age * PhysActive, 
 data = nhanes, family = binomial)
model2 <- glm(SleepTrouble ~ Age + PhysActive + Age * PhysActive + 
 Education, data = nhanes, family = binomial)
```
]

```r
# Deviances
(dev_model1 <- glance(model1)$deviance)
```

```
## [1] 553.2747
```

```r
(dev_model2 <- glance(model2)$deviance)
```

```
## [1] 542.7319
```

---
### Should we add `Education` to the model?

```r
# Drop-in-deviance test statistic
(test_stat <- dev_model1 - dev_model2)
```

```
## [1] 10.54277
```

```r
# p-value
1 - pchisq(test_stat, 4)
```

```
## [1] 0.03221292
```

---

### Should we add `Education` to the model?

- We can use the **`anova`** function to conduct this test 
    - Add **`test = "Chisq"`** to conduct the drop-in-deviance test

```r
anova(model1, model2, test = "Chisq")
```

```
## Analysis of Deviance Table
## 
## Model 1: SleepTrouble ~ Age + PhysActive + Age * PhysActive
## Model 2: SleepTrouble ~ Age + PhysActive + Age * PhysActive + Education
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1       496     553.27                       
## 2       492     542.73  4   10.543  0.03221 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---

## Framingham Study

- We will analyze data from a cardiovascular study on residents in Framingham, MA

- **Goal**: Predict whether or not a participant has a 10-year risk of future coronary heart disease

- Original data contains information from 4,000+ participants. We will use 500 for this analysis.