The goal of this exercise is to walk through a logistic regression analysis. It will give you a basic idea of the analysis steps and thought-process; however, due to class time constraints, this analysis is not exhaustive.

library(tidyverse)
library(broom)
library(pROC)
library(plotROC) 
library(knitr)

This data is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The goal is to predict whether a patient has a 10-year risk of future coronary heart disease. The dataset includes the following:

Load and prepare data

heart_data <- read_csv("https://raw.githubusercontent.com/sta210-sp20/datasets/master/framingham.csv") %>%
  drop_na() %>% #remove observations with missing values
  mutate(education = case_when(
    education == 1 ~ "Some HS", 
    education == 2 ~ "HS or GED", 
    education == 3 ~ "Some College", 
    education == 4 ~ "College"
  ),
  TenYearCHD = as.factor(TenYearCHD), 
  currentSmoker = as.factor(currentSmoker),
  ageCent = age - mean(age), 
  totCholCent = totChol - mean(totChol), 
  BPMeds = as.factor(BPMeds), 
  prevalentStroke = as.factor(prevalentStroke), 
  prevalentHyp = as.factor(prevalentHyp), 
  male = as.factor(male), 
  diabetes = as.factor(diabetes)
  )

Fit logistic model

risk_m <- glm(TenYearCHD ~ ageCent + currentSmoker + totCholCent, 
              data = heart_data, family = binomial)
tidy(risk_m, conf.int = TRUE, exponentiate = FALSE) %>% 
  kable(format = "markdown", digits = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -2.111 0.077 -27.519 0.000 -2.264 -1.963
ageCent 0.081 0.006 13.477 0.000 0.070 0.093
currentSmoker1 0.447 0.099 4.537 0.000 0.255 0.641
totCholCent 0.003 0.001 2.339 0.019 0.000 0.005

Predictions

For a new patient

x0 <- data_frame(ageCent = (60 - 49.552), 
                 totCholCent = (263 - 236.848), 
currentSmoker = as.factor(0))
x0
## # A tibble: 1 x 3
##   ageCent totCholCent currentSmoker
##     <dbl>       <dbl> <fct>        
## 1    10.4        26.2 0

Predicted log-odds

predict(risk_m, x0) 
##         1 
## -1.192775

Predicted probabilities

predict(risk_m, x0, type = "response") 
##         1 
## 0.2327631

Based on this probability, would you consider this patient as being high risk for getting coronary heart disease in the next 10 years? Why or why not?

For all observations using augment

risk_m_aug <- augment(risk_m, type.predict = "response", 
                      type.residuals = "response")
risk_m_aug
## # A tibble: 3,658 x 11
##    TenYearCHD ageCent currentSmoker totCholCent .fitted .se.fit  .resid    .hat
##    <fct>        <dbl> <fct>               <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 0           -10.6  0                  -41.8   0.0441 0.00532 -0.0441 6.72e-4
##  2 0            -3.55 0                   13.2   0.0857 0.00698 -0.0857 6.21e-4
##  3 0            -1.55 1                    8.15  0.146  0.00871 -0.146  6.09e-4
##  4 1            11.4  1                  -11.8   0.319  0.0202   0.681  1.88e-3
##  5 0            -3.55 1                   48.2   0.138  0.0108  -0.138  9.80e-4
##  6 0            -6.55 0                   -8.85  0.0649 0.00614 -0.0649 6.21e-4
##  7 1            13.4  0                  -31.8   0.251  0.0184   0.749  1.79e-3
##  8 0            -4.55 1                   76.2   0.137  0.0134  -0.137  1.53e-3
##  9 0             2.45 0                   23.2   0.136  0.00852 -0.136  6.19e-4
## 10 0            -6.55 1                  -11.8   0.0973 0.00733 -0.0973 6.11e-4
## # … with 3,648 more rows, and 3 more variables: .sigma <dbl>, .cooksd <dbl>,
## #   .std.resid <dbl>

Confusion Matrix

threshold <- 0.1
risk_m_aug %>%
  mutate(risk_predict = if_else(.fitted > threshold, "1: Yes", "0: No")) %>%
  group_by(TenYearCHD, risk_predict) %>%
  summarise(n = n()) %>%
  kable(format="markdown")
TenYearCHD risk_predict n
0 0: No 1265
0 1: Yes 1836
1 0: No 85
1 1: Yes 472
risk_m_aug %>%
  mutate(risk_predict = if_else(.fitted > threshold, "1: Yes", "0: No")) %>%
  select(TenYearCHD, risk_predict)
## # A tibble: 3,658 x 2
##    TenYearCHD risk_predict
##    <fct>      <chr>       
##  1 0          0: No       
##  2 0          0: No       
##  3 0          1: Yes      
##  4 1          1: Yes      
##  5 0          1: Yes      
##  6 0          0: No       
##  7 1          1: Yes      
##  8 0          1: Yes      
##  9 0          1: Yes      
## 10 0          0: No       
## # … with 3,648 more rows

What proportion of observations were misclassified?

What is the disadvantage of relying on the confusion matrix to assess the accuracy of the model?

ROC Curve

(roc_curve <- ggplot(risk_m_aug, 
                     aes(d = as.numeric(TenYearCHD) - 1, 
                         m = .fitted)) +
  geom_roc(n.cuts = 10, labelround = 3) + 
  geom_abline(intercept = 0) + 
  labs(x = "False Positive Rate (1 - Specificity)", 
       y = "True Positive Rate (Sensitivity)") )

calc_auc(roc_curve)$AUC
## [1] 0.6972743

A doctor plans to use the results from your model to help select patients for a new heart disease prevention program. She asks you which threshold would be best to select patients for this program. Based on the ROC curve from the previous slide, which threshold would you recommend to the doctor? Why?

Assumptions

Why we don’t plot raw residuals

ggplot(data = risk_m_aug, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red") +
  labs(x = "Predicted Values", y = "Raw Residuals")

Binned residual plots

arm::binnedplot(x = risk_m_aug$.fitted, y = risk_m_aug$.resid,
                xlab = "Predicted Probabilities", 
                main = "Binned Residual vs. Predicted Values", 
                col.int = FALSE)

arm::binnedplot(x = risk_m_aug$ageCent, 
                y = risk_m_aug$.resid, 
                col.int = FALSE,
                xlab = "Age (Mean-Centered)", 
                main = "Binned Residual vs. Age")

arm::binnedplot(x = risk_m_aug$totCholCent, 
                y = risk_m_aug$.resid, 
                col.int = FALSE,
                xlab = "Total Cholesterol (Mean-Centered)", 
                main = "Binned Residual vs. Total Cholesterol")

risk_m_aug %>%
  group_by(currentSmoker) %>%
  summarise(mean_resid = mean(.resid))
## # A tibble: 2 x 2
##   currentSmoker mean_resid
##   <fct>              <dbl>
## 1 0              -2.95e-14
## 2 1              -2.42e-14

Check assumptions:

- Linearity? - Randomness? - Independence?

Inference for coefficients

How is the test statistic for currentSmoker1 calculated?

Is totCholCent a statistically significant predictor of the odds a person is high risk for coronary heart disease?

Justify your answer using the test statistic and p-value.

Justify your answer using the confidence interval.

Drop-in-Deviance Test

model_red <- glm(TenYearCHD ~ ageCent + currentSmoker + totChol, 
              data = heart_data, family = binomial)
model_full <- glm(TenYearCHD ~ ageCent + currentSmoker + totChol + 
               education, 
              data = heart_data, family = binomial)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -2.733 0.307 -8.891 0.000 -3.338 -2.132
ageCent 0.078 0.006 12.701 0.000 0.066 0.091
currentSmoker1 0.449 0.099 4.540 0.000 0.255 0.643
totChol 0.003 0.001 2.503 0.012 0.001 0.005
educationHS or GED -0.150 0.173 -0.869 0.385 -0.485 0.194
educationSome College -0.203 0.192 -1.060 0.289 -0.578 0.175
educationSome HS 0.116 0.160 0.725 0.468 -0.192 0.436
anova(model_red, model_full, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: TenYearCHD ~ ageCent + currentSmoker + totChol
## Model 2: TenYearCHD ~ ageCent + currentSmoker + totChol + education
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1      3654     2895.0                       
## 2      3651     2887.2  3   7.7836   0.0507 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC

glance(model_red)$AIC
## [1] 2902.989
glance(model_full)$AIC
## [1] 2901.206

Based on the drop-in-deviance test, which model would you select?

Based on AIC, which model would you select?

Model Selection using step

full_model <- glm(TenYearCHD ~ ., data = heart_data, family = "binomial")
select_model <- step(full_model, direction = "backward")
## Start:  AIC=2788.17
## TenYearCHD ~ male + age + education + currentSmoker + cigsPerDay + 
##     BPMeds + prevalentStroke + prevalentHyp + diabetes + totChol + 
##     sysBP + diaBP + BMI + heartRate + glucose + ageCent + totCholCent
## 
## 
## Step:  AIC=2788.17
## TenYearCHD ~ male + age + education + currentSmoker + cigsPerDay + 
##     BPMeds + prevalentStroke + prevalentHyp + diabetes + totChol + 
##     sysBP + diaBP + BMI + heartRate + glucose + ageCent
## 
## 
## Step:  AIC=2788.17
## TenYearCHD ~ male + age + education + currentSmoker + cigsPerDay + 
##     BPMeds + prevalentStroke + prevalentHyp + diabetes + totChol + 
##     sysBP + diaBP + BMI + heartRate + glucose
## 
##                   Df Deviance    AIC
## - education        3   2755.4 2785.4
## - diabetes         1   2752.2 2786.2
## - BMI              1   2752.3 2786.3
## - currentSmoker    1   2752.4 2786.4
## - diaBP            1   2752.6 2786.6
## - BPMeds           1   2752.7 2786.7
## - heartRate        1   2752.7 2786.7
## - prevalentStroke  1   2754.1 2788.1
## <none>                 2752.2 2788.2
## - prevalentHyp     1   2755.0 2789.0
## - totChol          1   2756.6 2790.6
## - cigsPerDay       1   2760.4 2794.4
## - glucose          1   2763.0 2797.0
## - sysBP            1   2768.6 2802.6
## - male             1   2776.0 2810.0
## - age              1   2839.6 2873.6
## 
## Step:  AIC=2785.42
## TenYearCHD ~ male + age + currentSmoker + cigsPerDay + BPMeds + 
##     prevalentStroke + prevalentHyp + diabetes + totChol + sysBP + 
##     diaBP + BMI + heartRate + glucose
## 
##                   Df Deviance    AIC
## - diabetes         1   2755.4 2783.4
## - currentSmoker    1   2755.6 2783.6
## - BMI              1   2755.8 2783.8
## - BPMeds           1   2755.9 2783.9
## - diaBP            1   2755.9 2783.9
## - heartRate        1   2755.9 2783.9
## - prevalentStroke  1   2757.4 2785.4
## <none>                 2755.4 2785.4
## - prevalentHyp     1   2758.2 2786.2
## - totChol          1   2759.4 2787.4
## - cigsPerDay       1   2763.6 2791.6
## - glucose          1   2765.9 2793.9
## - sysBP            1   2772.6 2800.6
## - male             1   2781.4 2809.4
## - age              1   2852.8 2880.8
## 
## Step:  AIC=2783.44
## TenYearCHD ~ male + age + currentSmoker + cigsPerDay + BPMeds + 
##     prevalentStroke + prevalentHyp + totChol + sysBP + diaBP + 
##     BMI + heartRate + glucose
## 
##                   Df Deviance    AIC
## - currentSmoker    1   2755.7 2781.7
## - BMI              1   2755.9 2781.9
## - BPMeds           1   2755.9 2781.9
## - diaBP            1   2756.0 2782.0
## - heartRate        1   2756.0 2782.0
## - prevalentStroke  1   2757.4 2783.4
## <none>                 2755.4 2783.4
## - prevalentHyp     1   2758.3 2784.3
## - totChol          1   2759.5 2785.5
## - cigsPerDay       1   2763.7 2789.7
## - sysBP            1   2772.6 2798.6
## - glucose          1   2774.3 2800.3
## - male             1   2781.5 2807.5
## - age              1   2853.0 2879.0
## 
## Step:  AIC=2781.65
## TenYearCHD ~ male + age + cigsPerDay + BPMeds + prevalentStroke + 
##     prevalentHyp + totChol + sysBP + diaBP + BMI + heartRate + 
##     glucose
## 
##                   Df Deviance    AIC
## - BMI              1   2756.0 2780.0
## - BPMeds           1   2756.1 2780.1
## - heartRate        1   2756.2 2780.2
## - diaBP            1   2756.2 2780.2
## - prevalentStroke  1   2757.6 2781.6
## <none>                 2755.7 2781.7
## - prevalentHyp     1   2758.5 2782.5
## - totChol          1   2759.7 2783.7
## - sysBP            1   2772.9 2796.9
## - glucose          1   2774.4 2798.4
## - cigsPerDay       1   2777.8 2801.8
## - male             1   2781.7 2805.7
## - age              1   2853.0 2877.0
## 
## Step:  AIC=2780.01
## TenYearCHD ~ male + age + cigsPerDay + BPMeds + prevalentStroke + 
##     prevalentHyp + totChol + sysBP + diaBP + heartRate + glucose
## 
##                   Df Deviance    AIC
## - diaBP            1   2756.4 2778.4
## - BPMeds           1   2756.5 2778.5
## - heartRate        1   2756.5 2778.5
## <none>                 2756.0 2780.0
## - prevalentStroke  1   2758.0 2780.0
## - prevalentHyp     1   2759.0 2781.0
## - totChol          1   2760.1 2782.1
## - sysBP            1   2773.2 2795.2
## - glucose          1   2775.2 2797.2
## - cigsPerDay       1   2777.8 2799.8
## - male             1   2782.3 2804.3
## - age              1   2853.3 2875.3
## 
## Step:  AIC=2778.42
## TenYearCHD ~ male + age + cigsPerDay + BPMeds + prevalentStroke + 
##     prevalentHyp + totChol + sysBP + heartRate + glucose
## 
##                   Df Deviance    AIC
## - BPMeds           1   2756.9 2776.9
## - heartRate        1   2757.0 2777.0
## - prevalentStroke  1   2758.4 2778.4
## <none>                 2756.4 2778.4
## - prevalentHyp     1   2759.1 2779.1
## - totChol          1   2760.5 2780.5
## - glucose          1   2776.0 2796.0
## - cigsPerDay       1   2778.5 2798.5
## - sysBP            1   2780.3 2800.3
## - male             1   2782.3 2802.3
## - age              1   2861.3 2881.3
## 
## Step:  AIC=2776.91
## TenYearCHD ~ male + age + cigsPerDay + prevalentStroke + prevalentHyp + 
##     totChol + sysBP + heartRate + glucose
## 
##                   Df Deviance    AIC
## - heartRate        1   2757.5 2775.5
## <none>                 2756.9 2776.9
## - prevalentStroke  1   2759.1 2777.1
## - prevalentHyp     1   2759.8 2777.8
## - totChol          1   2761.1 2779.1
## - glucose          1   2776.6 2794.6
## - cigsPerDay       1   2779.0 2797.0
## - sysBP            1   2782.2 2800.2
## - male             1   2782.5 2800.5
## - age              1   2862.0 2880.0
## 
## Step:  AIC=2775.5
## TenYearCHD ~ male + age + cigsPerDay + prevalentStroke + prevalentHyp + 
##     totChol + sysBP + glucose
## 
##                   Df Deviance    AIC
## <none>                 2757.5 2775.5
## - prevalentStroke  1   2759.8 2775.8
## - prevalentHyp     1   2760.3 2776.3
## - totChol          1   2761.5 2777.5
## - glucose          1   2776.7 2792.7
## - cigsPerDay       1   2779.0 2795.0
## - sysBP            1   2782.3 2798.3
## - male             1   2784.4 2800.4
## - age              1   2864.5 2880.5
tidy(select_model, conf.int = TRUE, exponentiate = FALSE) %>%
  kable(format = "markdown", digits = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -8.746 0.522 -16.740 0.000 -9.782 -7.734
male1 0.553 0.107 5.170 0.000 0.344 0.764
age 0.065 0.006 10.153 0.000 0.053 0.078
cigsPerDay 0.020 0.004 4.683 0.000 0.011 0.028
prevalentStroke1 0.752 0.484 1.554 0.120 -0.234 1.687
prevalentHyp1 0.226 0.135 1.671 0.095 -0.040 0.490
totChol 0.002 0.001 2.011 0.044 0.000 0.004
sysBP 0.014 0.003 4.976 0.000 0.009 0.020
glucose 0.007 0.002 4.374 0.000 0.004 0.011

References

Data obtained from https://www.kaggle.com/amanajmera1/framingham-heart-study-dataset