The goal of this exercise is to walk through a logistic regression analysis. It will give you a basic idea of the analysis steps and thought-process; however, due to class time constraints, this analysis is not exhaustive.
library(tidyverse)
library(broom)
library(pROC)
library(plotROC)
library(knitr)
This data is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The goal is to predict whether a patient has a 10-year risk of future coronary heart disease. The dataset includes the following:
male
: 0 = Female; 1 = Maleage
: Age at exam time.education
: 1 = Some High School; 2 = High School or GED; 3 = Some College or Vocational School; 4 = CollegecurrentSmoker
: 0 = nonsmoker; 1 = smokercigsPerDay
: number of cigarettes smoked per day (estimated average)BPMeds
: 0 = Not on Blood Pressure medications; 1 = Is on Blood Pressure medicationsprevalentStroke
: 0 = Stroke not prevalent in family history; 1 = Stroke prevalent in family historyprevalentHyp
: 0 = Hypertension not prevalent in family history; 1 = Hypertension prevalent in family historydiabetes
: 0 = No; 1 = YestotChol
: total cholesterol (mg/dL)sysBP
: systolic blood pressure (mmHg)diaBP
: diastolic blood pressure (mmHg)BMI
: BodyMass Index calculated as: Weight (kg) / Height(meter-squared)heartRate
Beats/Min (Ventricular)glucose
: total glucose mg/dLTenYearCHD
: 0 = Patient doesn’t have 10-year risk of future coronary heart disease; 1 = Patient has 10-year risk of future coronary heart diseaseheart_data <- read_csv("https://raw.githubusercontent.com/sta210-sp20/datasets/master/framingham.csv") %>%
drop_na() %>% #remove observations with missing values
mutate(education = case_when(
education == 1 ~ "Some HS",
education == 2 ~ "HS or GED",
education == 3 ~ "Some College",
education == 4 ~ "College"
),
TenYearCHD = as.factor(TenYearCHD),
currentSmoker = as.factor(currentSmoker),
ageCent = age - mean(age),
totCholCent = totChol - mean(totChol),
BPMeds = as.factor(BPMeds),
prevalentStroke = as.factor(prevalentStroke),
prevalentHyp = as.factor(prevalentHyp),
male = as.factor(male),
diabetes = as.factor(diabetes)
)
risk_m <- glm(TenYearCHD ~ ageCent + currentSmoker + totCholCent,
data = heart_data, family = binomial)
tidy(risk_m, conf.int = TRUE, exponentiate = FALSE) %>%
kable(format = "markdown", digits = 3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -2.111 | 0.077 | -27.519 | 0.000 | -2.264 | -1.963 |
ageCent | 0.081 | 0.006 | 13.477 | 0.000 | 0.070 | 0.093 |
currentSmoker1 | 0.447 | 0.099 | 4.537 | 0.000 | 0.255 | 0.641 |
totCholCent | 0.003 | 0.001 | 2.339 | 0.019 | 0.000 | 0.005 |
x0 <- data_frame(ageCent = (60 - 49.552),
totCholCent = (263 - 236.848),
currentSmoker = as.factor(0))
x0
## # A tibble: 1 x 3
## ageCent totCholCent currentSmoker
## <dbl> <dbl> <fct>
## 1 10.4 26.2 0
predict(risk_m, x0)
## 1
## -1.192775
predict(risk_m, x0, type = "response")
## 1
## 0.2327631
Based on this probability, would you consider this patient as being high risk for getting coronary heart disease in the next 10 years? Why or why not?
augment
risk_m_aug <- augment(risk_m, type.predict = "response",
type.residuals = "response")
risk_m_aug
## # A tibble: 3,658 x 11
## TenYearCHD ageCent currentSmoker totCholCent .fitted .se.fit .resid .hat
## <fct> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 -10.6 0 -41.8 0.0441 0.00532 -0.0441 6.72e-4
## 2 0 -3.55 0 13.2 0.0857 0.00698 -0.0857 6.21e-4
## 3 0 -1.55 1 8.15 0.146 0.00871 -0.146 6.09e-4
## 4 1 11.4 1 -11.8 0.319 0.0202 0.681 1.88e-3
## 5 0 -3.55 1 48.2 0.138 0.0108 -0.138 9.80e-4
## 6 0 -6.55 0 -8.85 0.0649 0.00614 -0.0649 6.21e-4
## 7 1 13.4 0 -31.8 0.251 0.0184 0.749 1.79e-3
## 8 0 -4.55 1 76.2 0.137 0.0134 -0.137 1.53e-3
## 9 0 2.45 0 23.2 0.136 0.00852 -0.136 6.19e-4
## 10 0 -6.55 1 -11.8 0.0973 0.00733 -0.0973 6.11e-4
## # … with 3,648 more rows, and 3 more variables: .sigma <dbl>, .cooksd <dbl>,
## # .std.resid <dbl>
threshold <- 0.1
risk_m_aug %>%
mutate(risk_predict = if_else(.fitted > threshold, "1: Yes", "0: No")) %>%
group_by(TenYearCHD, risk_predict) %>%
summarise(n = n()) %>%
kable(format="markdown")
TenYearCHD | risk_predict | n |
---|---|---|
0 | 0: No | 1265 |
0 | 1: Yes | 1836 |
1 | 0: No | 85 |
1 | 1: Yes | 472 |
risk_m_aug %>%
mutate(risk_predict = if_else(.fitted > threshold, "1: Yes", "0: No")) %>%
select(TenYearCHD, risk_predict)
## # A tibble: 3,658 x 2
## TenYearCHD risk_predict
## <fct> <chr>
## 1 0 0: No
## 2 0 0: No
## 3 0 1: Yes
## 4 1 1: Yes
## 5 0 1: Yes
## 6 0 0: No
## 7 1 1: Yes
## 8 0 1: Yes
## 9 0 1: Yes
## 10 0 0: No
## # … with 3,648 more rows
What proportion of observations were misclassified?
What is the disadvantage of relying on the confusion matrix to assess the accuracy of the model?
(roc_curve <- ggplot(risk_m_aug,
aes(d = as.numeric(TenYearCHD) - 1,
m = .fitted)) +
geom_roc(n.cuts = 10, labelround = 3) +
geom_abline(intercept = 0) +
labs(x = "False Positive Rate (1 - Specificity)",
y = "True Positive Rate (Sensitivity)") )
calc_auc(roc_curve)$AUC
## [1] 0.6972743
A doctor plans to use the results from your model to help select patients for a new heart disease prevention program. She asks you which threshold would be best to select patients for this program. Based on the ROC curve from the previous slide, which threshold would you recommend to the doctor? Why?
ggplot(data = risk_m_aug, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
labs(x = "Predicted Values", y = "Raw Residuals")
arm::binnedplot(x = risk_m_aug$.fitted, y = risk_m_aug$.resid,
xlab = "Predicted Probabilities",
main = "Binned Residual vs. Predicted Values",
col.int = FALSE)
arm::binnedplot(x = risk_m_aug$ageCent,
y = risk_m_aug$.resid,
col.int = FALSE,
xlab = "Age (Mean-Centered)",
main = "Binned Residual vs. Age")
arm::binnedplot(x = risk_m_aug$totCholCent,
y = risk_m_aug$.resid,
col.int = FALSE,
xlab = "Total Cholesterol (Mean-Centered)",
main = "Binned Residual vs. Total Cholesterol")
risk_m_aug %>%
group_by(currentSmoker) %>%
summarise(mean_resid = mean(.resid))
## # A tibble: 2 x 2
## currentSmoker mean_resid
## <fct> <dbl>
## 1 0 -2.95e-14
## 2 1 -2.42e-14
- Linearity? - Randomness? - Independence?
How is the test statistic for currentSmoker1
calculated?
Is totCholCent
a statistically significant predictor of the odds a person is high risk for coronary heart disease?
Justify your answer using the test statistic and p-value.
Justify your answer using the confidence interval.
model_red <- glm(TenYearCHD ~ ageCent + currentSmoker + totChol,
data = heart_data, family = binomial)
model_full <- glm(TenYearCHD ~ ageCent + currentSmoker + totChol +
education,
data = heart_data, family = binomial)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -2.733 | 0.307 | -8.891 | 0.000 | -3.338 | -2.132 |
ageCent | 0.078 | 0.006 | 12.701 | 0.000 | 0.066 | 0.091 |
currentSmoker1 | 0.449 | 0.099 | 4.540 | 0.000 | 0.255 | 0.643 |
totChol | 0.003 | 0.001 | 2.503 | 0.012 | 0.001 | 0.005 |
educationHS or GED | -0.150 | 0.173 | -0.869 | 0.385 | -0.485 | 0.194 |
educationSome College | -0.203 | 0.192 | -1.060 | 0.289 | -0.578 | 0.175 |
educationSome HS | 0.116 | 0.160 | 0.725 | 0.468 | -0.192 | 0.436 |
anova(model_red, model_full, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: TenYearCHD ~ ageCent + currentSmoker + totChol
## Model 2: TenYearCHD ~ ageCent + currentSmoker + totChol + education
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 3654 2895.0
## 2 3651 2887.2 3 7.7836 0.0507 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
glance(model_red)$AIC
## [1] 2902.989
glance(model_full)$AIC
## [1] 2901.206
Based on the drop-in-deviance test, which model would you select?
Based on AIC, which model would you select?
step
full_model <- glm(TenYearCHD ~ ., data = heart_data, family = "binomial")
select_model <- step(full_model, direction = "backward")
## Start: AIC=2788.17
## TenYearCHD ~ male + age + education + currentSmoker + cigsPerDay +
## BPMeds + prevalentStroke + prevalentHyp + diabetes + totChol +
## sysBP + diaBP + BMI + heartRate + glucose + ageCent + totCholCent
##
##
## Step: AIC=2788.17
## TenYearCHD ~ male + age + education + currentSmoker + cigsPerDay +
## BPMeds + prevalentStroke + prevalentHyp + diabetes + totChol +
## sysBP + diaBP + BMI + heartRate + glucose + ageCent
##
##
## Step: AIC=2788.17
## TenYearCHD ~ male + age + education + currentSmoker + cigsPerDay +
## BPMeds + prevalentStroke + prevalentHyp + diabetes + totChol +
## sysBP + diaBP + BMI + heartRate + glucose
##
## Df Deviance AIC
## - education 3 2755.4 2785.4
## - diabetes 1 2752.2 2786.2
## - BMI 1 2752.3 2786.3
## - currentSmoker 1 2752.4 2786.4
## - diaBP 1 2752.6 2786.6
## - BPMeds 1 2752.7 2786.7
## - heartRate 1 2752.7 2786.7
## - prevalentStroke 1 2754.1 2788.1
## <none> 2752.2 2788.2
## - prevalentHyp 1 2755.0 2789.0
## - totChol 1 2756.6 2790.6
## - cigsPerDay 1 2760.4 2794.4
## - glucose 1 2763.0 2797.0
## - sysBP 1 2768.6 2802.6
## - male 1 2776.0 2810.0
## - age 1 2839.6 2873.6
##
## Step: AIC=2785.42
## TenYearCHD ~ male + age + currentSmoker + cigsPerDay + BPMeds +
## prevalentStroke + prevalentHyp + diabetes + totChol + sysBP +
## diaBP + BMI + heartRate + glucose
##
## Df Deviance AIC
## - diabetes 1 2755.4 2783.4
## - currentSmoker 1 2755.6 2783.6
## - BMI 1 2755.8 2783.8
## - BPMeds 1 2755.9 2783.9
## - diaBP 1 2755.9 2783.9
## - heartRate 1 2755.9 2783.9
## - prevalentStroke 1 2757.4 2785.4
## <none> 2755.4 2785.4
## - prevalentHyp 1 2758.2 2786.2
## - totChol 1 2759.4 2787.4
## - cigsPerDay 1 2763.6 2791.6
## - glucose 1 2765.9 2793.9
## - sysBP 1 2772.6 2800.6
## - male 1 2781.4 2809.4
## - age 1 2852.8 2880.8
##
## Step: AIC=2783.44
## TenYearCHD ~ male + age + currentSmoker + cigsPerDay + BPMeds +
## prevalentStroke + prevalentHyp + totChol + sysBP + diaBP +
## BMI + heartRate + glucose
##
## Df Deviance AIC
## - currentSmoker 1 2755.7 2781.7
## - BMI 1 2755.9 2781.9
## - BPMeds 1 2755.9 2781.9
## - diaBP 1 2756.0 2782.0
## - heartRate 1 2756.0 2782.0
## - prevalentStroke 1 2757.4 2783.4
## <none> 2755.4 2783.4
## - prevalentHyp 1 2758.3 2784.3
## - totChol 1 2759.5 2785.5
## - cigsPerDay 1 2763.7 2789.7
## - sysBP 1 2772.6 2798.6
## - glucose 1 2774.3 2800.3
## - male 1 2781.5 2807.5
## - age 1 2853.0 2879.0
##
## Step: AIC=2781.65
## TenYearCHD ~ male + age + cigsPerDay + BPMeds + prevalentStroke +
## prevalentHyp + totChol + sysBP + diaBP + BMI + heartRate +
## glucose
##
## Df Deviance AIC
## - BMI 1 2756.0 2780.0
## - BPMeds 1 2756.1 2780.1
## - heartRate 1 2756.2 2780.2
## - diaBP 1 2756.2 2780.2
## - prevalentStroke 1 2757.6 2781.6
## <none> 2755.7 2781.7
## - prevalentHyp 1 2758.5 2782.5
## - totChol 1 2759.7 2783.7
## - sysBP 1 2772.9 2796.9
## - glucose 1 2774.4 2798.4
## - cigsPerDay 1 2777.8 2801.8
## - male 1 2781.7 2805.7
## - age 1 2853.0 2877.0
##
## Step: AIC=2780.01
## TenYearCHD ~ male + age + cigsPerDay + BPMeds + prevalentStroke +
## prevalentHyp + totChol + sysBP + diaBP + heartRate + glucose
##
## Df Deviance AIC
## - diaBP 1 2756.4 2778.4
## - BPMeds 1 2756.5 2778.5
## - heartRate 1 2756.5 2778.5
## <none> 2756.0 2780.0
## - prevalentStroke 1 2758.0 2780.0
## - prevalentHyp 1 2759.0 2781.0
## - totChol 1 2760.1 2782.1
## - sysBP 1 2773.2 2795.2
## - glucose 1 2775.2 2797.2
## - cigsPerDay 1 2777.8 2799.8
## - male 1 2782.3 2804.3
## - age 1 2853.3 2875.3
##
## Step: AIC=2778.42
## TenYearCHD ~ male + age + cigsPerDay + BPMeds + prevalentStroke +
## prevalentHyp + totChol + sysBP + heartRate + glucose
##
## Df Deviance AIC
## - BPMeds 1 2756.9 2776.9
## - heartRate 1 2757.0 2777.0
## - prevalentStroke 1 2758.4 2778.4
## <none> 2756.4 2778.4
## - prevalentHyp 1 2759.1 2779.1
## - totChol 1 2760.5 2780.5
## - glucose 1 2776.0 2796.0
## - cigsPerDay 1 2778.5 2798.5
## - sysBP 1 2780.3 2800.3
## - male 1 2782.3 2802.3
## - age 1 2861.3 2881.3
##
## Step: AIC=2776.91
## TenYearCHD ~ male + age + cigsPerDay + prevalentStroke + prevalentHyp +
## totChol + sysBP + heartRate + glucose
##
## Df Deviance AIC
## - heartRate 1 2757.5 2775.5
## <none> 2756.9 2776.9
## - prevalentStroke 1 2759.1 2777.1
## - prevalentHyp 1 2759.8 2777.8
## - totChol 1 2761.1 2779.1
## - glucose 1 2776.6 2794.6
## - cigsPerDay 1 2779.0 2797.0
## - sysBP 1 2782.2 2800.2
## - male 1 2782.5 2800.5
## - age 1 2862.0 2880.0
##
## Step: AIC=2775.5
## TenYearCHD ~ male + age + cigsPerDay + prevalentStroke + prevalentHyp +
## totChol + sysBP + glucose
##
## Df Deviance AIC
## <none> 2757.5 2775.5
## - prevalentStroke 1 2759.8 2775.8
## - prevalentHyp 1 2760.3 2776.3
## - totChol 1 2761.5 2777.5
## - glucose 1 2776.7 2792.7
## - cigsPerDay 1 2779.0 2795.0
## - sysBP 1 2782.3 2798.3
## - male 1 2784.4 2800.4
## - age 1 2864.5 2880.5
tidy(select_model, conf.int = TRUE, exponentiate = FALSE) %>%
kable(format = "markdown", digits = 3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -8.746 | 0.522 | -16.740 | 0.000 | -9.782 | -7.734 |
male1 | 0.553 | 0.107 | 5.170 | 0.000 | 0.344 | 0.764 |
age | 0.065 | 0.006 | 10.153 | 0.000 | 0.053 | 0.078 |
cigsPerDay | 0.020 | 0.004 | 4.683 | 0.000 | 0.011 | 0.028 |
prevalentStroke1 | 0.752 | 0.484 | 1.554 | 0.120 | -0.234 | 1.687 |
prevalentHyp1 | 0.226 | 0.135 | 1.671 | 0.095 | -0.040 | 0.490 |
totChol | 0.002 | 0.001 | 2.011 | 0.044 | 0.000 | 0.004 |
sysBP | 0.014 | 0.003 | 4.976 | 0.000 | 0.009 | 0.020 |
glucose | 0.007 | 0.002 | 4.374 | 0.000 | 0.004 | 0.011 |
Data obtained from https://www.kaggle.com/amanajmera1/framingham-heart-study-dataset