Logistic Regression

The goal of this exercise is to walk through a logistic regression analysis. It will give you a basic idea of the analysis steps and thought-process; however, due to class time constraints, this analysis is not exhaustive.

library(tidyverse)
library(broom)
library(pROC)
library(plotROC) 
library(knitr)

This data is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The goal is to predict whether a patient has a 10-year risk of future coronary heart disease. The dataset includes the following:

male: 0 = Female; 1 = Male
age: Age at exam time.
education: 1 = Some High School; 2 = High School or GED; 3 = Some College or Vocational School; 4 = College
currentSmoker: 0 = nonsmoker; 1 = smoker
cigsPerDay: number of cigarettes smoked per day (estimated average)
BPMeds: 0 = Not on Blood Pressure medications; 1 = Is on Blood Pressure medications
prevalentStroke: 0 = Stroke not prevalent in family history; 1 = Stroke prevalent in family history
prevalentHyp: 0 = Hypertension not prevalent in family history; 1 = Hypertension prevalent in family history
diabetes: 0 = No; 1 = Yes
totChol: total cholesterol (mg/dL)
sysBP: systolic blood pressure (mmHg)
diaBP: diastolic blood pressure (mmHg)
BMI: BodyMass Index calculated as: Weight (kg) / Height(meter-squared)
heartRate Beats/Min (Ventricular)
glucose: total glucose mg/dL
TenYearCHD: 0 = Patient doesn’t have 10-year risk of future coronary heart disease; 1 = Patient has 10-year risk of future coronary heart disease

Load and prepare data

heart_data <- read_csv("https://raw.githubusercontent.com/sta210-sp20/datasets/master/framingham.csv") %>%
  drop_na() %>% #remove observations with missing values
  mutate(education = case_when(
    education == 1 ~ "Some HS", 
    education == 2 ~ "HS or GED", 
    education == 3 ~ "Some College", 
    education == 4 ~ "College"
  ),
  TenYearCHD = as.factor(TenYearCHD), 
  currentSmoker = as.factor(currentSmoker),
  ageCent = age - mean(age), 
  totCholCent = totChol - mean(totChol), 
  BPMeds = as.factor(BPMeds), 
  prevalentStroke = as.factor(prevalentStroke), 
  prevalentHyp = as.factor(prevalentHyp), 
  male = as.factor(male), 
  diabetes = as.factor(diabetes)
  )

Fit logistic model

risk_m <- glm(TenYearCHD ~ ageCent + currentSmoker + totCholCent, 
              data = heart_data, family = binomial)
tidy(risk_m, conf.int = TRUE, exponentiate = FALSE) %>% 
  kable(format = "markdown", digits = 3)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-2.111	0.077	-27.519	0.000	-2.264	-1.963
ageCent	0.081	0.006	13.477	0.000	0.070	0.093
currentSmoker1	0.447	0.099	4.537	0.000	0.255	0.641
totCholCent	0.003	0.001	2.339	0.019	0.000	0.005

Predictions

For a new patient

x0 <- data_frame(ageCent = (60 - 49.552), 
                 totCholCent = (263 - 236.848), 
currentSmoker = as.factor(0))

x0

## # A tibble: 1 x 3
##   ageCent totCholCent currentSmoker
##     <dbl>       <dbl> <fct>        
## 1    10.4        26.2 0

Predicted log-odds

predict(risk_m, x0)

##         1 
## -1.192775

Predicted probabilities

predict(risk_m, x0, type = "response")

##         1 
## 0.2327631

Based on this probability, would you consider this patient as being high risk for getting coronary heart disease in the next 10 years? Why or why not?

For all observations using `augment`

risk_m_aug <- augment(risk_m, type.predict = "response", 
                      type.residuals = "response")

risk_m_aug

## # A tibble: 3,658 x 11
##    TenYearCHD ageCent currentSmoker totCholCent .fitted .se.fit  .resid    .hat
##    <fct>        <dbl> <fct>               <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 0           -10.6  0                  -41.8   0.0441 0.00532 -0.0441 6.72e-4
##  2 0            -3.55 0                   13.2   0.0857 0.00698 -0.0857 6.21e-4
##  3 0            -1.55 1                    8.15  0.146  0.00871 -0.146  6.09e-4
##  4 1            11.4  1                  -11.8   0.319  0.0202   0.681  1.88e-3
##  5 0            -3.55 1                   48.2   0.138  0.0108  -0.138  9.80e-4
##  6 0            -6.55 0                   -8.85  0.0649 0.00614 -0.0649 6.21e-4
##  7 1            13.4  0                  -31.8   0.251  0.0184   0.749  1.79e-3
##  8 0            -4.55 1                   76.2   0.137  0.0134  -0.137  1.53e-3
##  9 0             2.45 0                   23.2   0.136  0.00852 -0.136  6.19e-4
## 10 0            -6.55 1                  -11.8   0.0973 0.00733 -0.0973 6.11e-4
## # … with 3,648 more rows, and 3 more variables: .sigma <dbl>, .cooksd <dbl>,
## #   .std.resid <dbl>

Confusion Matrix

threshold <- 0.1

risk_m_aug %>%
  mutate(risk_predict = if_else(.fitted > threshold, "1: Yes", "0: No")) %>%
  group_by(TenYearCHD, risk_predict) %>%
  summarise(n = n()) %>%
  kable(format="markdown")

TenYearCHD	risk_predict	n
0	0: No	1265
0	1: Yes	1836
1	0: No	85
1	1: Yes	472

risk_m_aug %>%
  mutate(risk_predict = if_else(.fitted > threshold, "1: Yes", "0: No")) %>%
  select(TenYearCHD, risk_predict)

## # A tibble: 3,658 x 2
##    TenYearCHD risk_predict
##    <fct>      <chr>       
##  1 0          0: No       
##  2 0          0: No       
##  3 0          1: Yes      
##  4 1          1: Yes      
##  5 0          1: Yes      
##  6 0          0: No       
##  7 1          1: Yes      
##  8 0          1: Yes      
##  9 0          1: Yes      
## 10 0          0: No       
## # … with 3,648 more rows

What proportion of observations were misclassified?

What is the disadvantage of relying on the confusion matrix to assess the accuracy of the model?

ROC Curve

(roc_curve <- ggplot(risk_m_aug, 
                     aes(d = as.numeric(TenYearCHD) - 1, 
                         m = .fitted)) +
  geom_roc(n.cuts = 10, labelround = 3) + 
  geom_abline(intercept = 0) + 
  labs(x = "False Positive Rate (1 - Specificity)", 
       y = "True Positive Rate (Sensitivity)") )

calc_auc(roc_curve)$AUC

## [1] 0.6972743

A doctor plans to use the results from your model to help select patients for a new heart disease prevention program. She asks you which threshold would be best to select patients for this program. Based on the ROC curve from the previous slide, which threshold would you recommend to the doctor? Why?

Assumptions

Why we don’t plot raw residuals

ggplot(data = risk_m_aug, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red") +
  labs(x = "Predicted Values", y = "Raw Residuals")

Binned residual plots

arm::binnedplot(x = risk_m_aug$.fitted, y = risk_m_aug$.resid,
                xlab = "Predicted Probabilities", 
                main = "Binned Residual vs. Predicted Values", 
                col.int = FALSE)

arm::binnedplot(x = risk_m_aug$ageCent, 
                y = risk_m_aug$.resid, 
                col.int = FALSE,
                xlab = "Age (Mean-Centered)", 
                main = "Binned Residual vs. Age")

arm::binnedplot(x = risk_m_aug$totCholCent, 
                y = risk_m_aug$.resid, 
                col.int = FALSE,
                xlab = "Total Cholesterol (Mean-Centered)", 
                main = "Binned Residual vs. Total Cholesterol")

risk_m_aug %>%
  group_by(currentSmoker) %>%
  summarise(mean_resid = mean(.resid))

## # A tibble: 2 x 2
##   currentSmoker mean_resid
##   <fct>              <dbl>
## 1 0              -2.95e-14
## 2 1              -2.42e-14

Check assumptions:

- Linearity? - Randomness? - Independence?

Inference for coefficients

How is the test statistic for currentSmoker1 calculated?

Is totCholCent a statistically significant predictor of the odds a person is high risk for coronary heart disease?

Justify your answer using the test statistic and p-value.

Justify your answer using the confidence interval.

Drop-in-Deviance Test

model_red <- glm(TenYearCHD ~ ageCent + currentSmoker + totChol, 
              data = heart_data, family = binomial)
model_full <- glm(TenYearCHD ~ ageCent + currentSmoker + totChol + 
               education, 
              data = heart_data, family = binomial)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-2.733	0.307	-8.891	0.000	-3.338	-2.132
ageCent	0.078	0.006	12.701	0.000	0.066	0.091
currentSmoker1	0.449	0.099	4.540	0.000	0.255	0.643
totChol	0.003	0.001	2.503	0.012	0.001	0.005
educationHS or GED	-0.150	0.173	-0.869	0.385	-0.485	0.194
educationSome College	-0.203	0.192	-1.060	0.289	-0.578	0.175
educationSome HS	0.116	0.160	0.725	0.468	-0.192	0.436

anova(model_red, model_full, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: TenYearCHD ~ ageCent + currentSmoker + totChol
## Model 2: TenYearCHD ~ ageCent + currentSmoker + totChol + education
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1      3654     2895.0                       
## 2      3651     2887.2  3   7.7836   0.0507 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC

glance(model_red)$AIC

## [1] 2902.989

glance(model_full)$AIC

## [1] 2901.206

Based on the drop-in-deviance test, which model would you select?

Based on AIC, which model would you select?

Model Selection using `step`

full_model <- glm(TenYearCHD ~ ., data = heart_data, family = "binomial")

select_model <- step(full_model, direction = "backward")

## Start:  AIC=2788.17
## TenYearCHD ~ male + age + education + currentSmoker + cigsPerDay + 
##     BPMeds + prevalentStroke + prevalentHyp + diabetes + totChol + 
##     sysBP + diaBP + BMI + heartRate + glucose + ageCent + totCholCent
## 
## 
## Step:  AIC=2788.17
## TenYearCHD ~ male + age + education + currentSmoker + cigsPerDay + 
##     BPMeds + prevalentStroke + prevalentHyp + diabetes + totChol + 
##     sysBP + diaBP + BMI + heartRate + glucose + ageCent
## 
## 
## Step:  AIC=2788.17
## TenYearCHD ~ male + age + education + currentSmoker + cigsPerDay + 
##     BPMeds + prevalentStroke + prevalentHyp + diabetes + totChol + 
##     sysBP + diaBP + BMI + heartRate + glucose
## 
##                   Df Deviance    AIC
## - education        3   2755.4 2785.4
## - diabetes         1   2752.2 2786.2
## - BMI              1   2752.3 2786.3
## - currentSmoker    1   2752.4 2786.4
## - diaBP            1   2752.6 2786.6
## - BPMeds           1   2752.7 2786.7
## - heartRate        1   2752.7 2786.7
## - prevalentStroke  1   2754.1 2788.1
## <none>                 2752.2 2788.2
## - prevalentHyp     1   2755.0 2789.0
## - totChol          1   2756.6 2790.6
## - cigsPerDay       1   2760.4 2794.4
## - glucose          1   2763.0 2797.0
## - sysBP            1   2768.6 2802.6
## - male             1   2776.0 2810.0
## - age              1   2839.6 2873.6
## 
## Step:  AIC=2785.42
## TenYearCHD ~ male + age + currentSmoker + cigsPerDay + BPMeds + 
##     prevalentStroke + prevalentHyp + diabetes + totChol + sysBP + 
##     diaBP + BMI + heartRate + glucose
## 
##                   Df Deviance    AIC
## - diabetes         1   2755.4 2783.4
## - currentSmoker    1   2755.6 2783.6
## - BMI              1   2755.8 2783.8
## - BPMeds           1   2755.9 2783.9
## - diaBP            1   2755.9 2783.9
## - heartRate        1   2755.9 2783.9
## - prevalentStroke  1   2757.4 2785.4
## <none>                 2755.4 2785.4
## - prevalentHyp     1   2758.2 2786.2
## - totChol          1   2759.4 2787.4
## - cigsPerDay       1   2763.6 2791.6
## - glucose          1   2765.9 2793.9
## - sysBP            1   2772.6 2800.6
## - male             1   2781.4 2809.4
## - age              1   2852.8 2880.8
## 
## Step:  AIC=2783.44
## TenYearCHD ~ male + age + currentSmoker + cigsPerDay + BPMeds + 
##     prevalentStroke + prevalentHyp + totChol + sysBP + diaBP + 
##     BMI + heartRate + glucose
## 
##                   Df Deviance    AIC
## - currentSmoker    1   2755.7 2781.7
## - BMI              1   2755.9 2781.9
## - BPMeds           1   2755.9 2781.9
## - diaBP            1   2756.0 2782.0
## - heartRate        1   2756.0 2782.0
## - prevalentStroke  1   2757.4 2783.4
## <none>                 2755.4 2783.4
## - prevalentHyp     1   2758.3 2784.3
## - totChol          1   2759.5 2785.5
## - cigsPerDay       1   2763.7 2789.7
## - sysBP            1   2772.6 2798.6
## - glucose          1   2774.3 2800.3
## - male             1   2781.5 2807.5
## - age              1   2853.0 2879.0
## 
## Step:  AIC=2781.65
## TenYearCHD ~ male + age + cigsPerDay + BPMeds + prevalentStroke + 
##     prevalentHyp + totChol + sysBP + diaBP + BMI + heartRate + 
##     glucose
## 
##                   Df Deviance    AIC
## - BMI              1   2756.0 2780.0
## - BPMeds           1   2756.1 2780.1
## - heartRate        1   2756.2 2780.2
## - diaBP            1   2756.2 2780.2
## - prevalentStroke  1   2757.6 2781.6
## <none>                 2755.7 2781.7
## - prevalentHyp     1   2758.5 2782.5
## - totChol          1   2759.7 2783.7
## - sysBP            1   2772.9 2796.9
## - glucose          1   2774.4 2798.4
## - cigsPerDay       1   2777.8 2801.8
## - male             1   2781.7 2805.7
## - age              1   2853.0 2877.0
## 
## Step:  AIC=2780.01
## TenYearCHD ~ male + age + cigsPerDay + BPMeds + prevalentStroke + 
##     prevalentHyp + totChol + sysBP + diaBP + heartRate + glucose
## 
##                   Df Deviance    AIC
## - diaBP            1   2756.4 2778.4
## - BPMeds           1   2756.5 2778.5
## - heartRate        1   2756.5 2778.5
## <none>                 2756.0 2780.0
## - prevalentStroke  1   2758.0 2780.0
## - prevalentHyp     1   2759.0 2781.0
## - totChol          1   2760.1 2782.1
## - sysBP            1   2773.2 2795.2
## - glucose          1   2775.2 2797.2
## - cigsPerDay       1   2777.8 2799.8
## - male             1   2782.3 2804.3
## - age              1   2853.3 2875.3
## 
## Step:  AIC=2778.42
## TenYearCHD ~ male + age + cigsPerDay + BPMeds + prevalentStroke + 
##     prevalentHyp + totChol + sysBP + heartRate + glucose
## 
##                   Df Deviance    AIC
## - BPMeds           1   2756.9 2776.9
## - heartRate        1   2757.0 2777.0
## - prevalentStroke  1   2758.4 2778.4
## <none>                 2756.4 2778.4
## - prevalentHyp     1   2759.1 2779.1
## - totChol          1   2760.5 2780.5
## - glucose          1   2776.0 2796.0
## - cigsPerDay       1   2778.5 2798.5
## - sysBP            1   2780.3 2800.3
## - male             1   2782.3 2802.3
## - age              1   2861.3 2881.3
## 
## Step:  AIC=2776.91
## TenYearCHD ~ male + age + cigsPerDay + prevalentStroke + prevalentHyp + 
##     totChol + sysBP + heartRate + glucose
## 
##                   Df Deviance    AIC
## - heartRate        1   2757.5 2775.5
## <none>                 2756.9 2776.9
## - prevalentStroke  1   2759.1 2777.1
## - prevalentHyp     1   2759.8 2777.8
## - totChol          1   2761.1 2779.1
## - glucose          1   2776.6 2794.6
## - cigsPerDay       1   2779.0 2797.0
## - sysBP            1   2782.2 2800.2
## - male             1   2782.5 2800.5
## - age              1   2862.0 2880.0
## 
## Step:  AIC=2775.5
## TenYearCHD ~ male + age + cigsPerDay + prevalentStroke + prevalentHyp + 
##     totChol + sysBP + glucose
## 
##                   Df Deviance    AIC
## <none>                 2757.5 2775.5
## - prevalentStroke  1   2759.8 2775.8
## - prevalentHyp     1   2760.3 2776.3
## - totChol          1   2761.5 2777.5
## - glucose          1   2776.7 2792.7
## - cigsPerDay       1   2779.0 2795.0
## - sysBP            1   2782.3 2798.3
## - male             1   2784.4 2800.4
## - age              1   2864.5 2880.5

tidy(select_model, conf.int = TRUE, exponentiate = FALSE) %>%
  kable(format = "markdown", digits = 3)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-8.746	0.522	-16.740	0.000	-9.782	-7.734
male1	0.553	0.107	5.170	0.000	0.344	0.764
age	0.065	0.006	10.153	0.000	0.053	0.078
cigsPerDay	0.020	0.004	4.683	0.000	0.011	0.028
prevalentStroke1	0.752	0.484	1.554	0.120	-0.234	1.687
prevalentHyp1	0.226	0.135	1.671	0.095	-0.040	0.490
totChol	0.002	0.001	2.011	0.044	0.000	0.004
sysBP	0.014	0.003	4.976	0.000	0.009	0.020
glucose	0.007	0.002	4.374	0.000	0.004	0.011

References

Data obtained from https://www.kaggle.com/amanajmera1/framingham-heart-study-dataset