--- title: "Logistic Regression" date: "`r Sys.Date()`" output: html_document: theme: readable --- ```{r global-options, include=FALSE} knitr::opts_chunk$set(warning=FALSE, message=FALSE) ``` The goal of this exercise is to walk through a logistic regression analysis. It will give you a basic idea of the analysis steps and thought-process; however, due to class time constraints, this analysis is not exhaustive. ```{r load-packages} library(tidyverse) library(broom) library(pROC) library(plotROC) library(knitr) ``` This data is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The goal is to predict whether a patient has a 10-year risk of future coronary heart disease. The dataset includes the following: - `male`: 0 = Female; 1 = Male - `age`: Age at exam time. - `education`: 1 = Some High School; 2 = High School or GED; 3 = Some College or Vocational School; 4 = College - `currentSmoker`: 0 = nonsmoker; 1 = smoker - `cigsPerDay`: number of cigarettes smoked per day (estimated average) - `BPMeds`: 0 = Not on Blood Pressure medications; 1 = Is on Blood Pressure medications - `prevalentStroke`: 0 = Stroke not prevalent in family history; 1 = Stroke prevalent in family history - `prevalentHyp`: 0 = Hypertension not prevalent in family history; 1 = Hypertension prevalent in family history - `diabetes`: 0 = No; 1 = Yes - `totChol`: total cholesterol (mg/dL) - `sysBP`: systolic blood pressure (mmHg) - `diaBP`: diastolic blood pressure (mmHg) - `BMI`: BodyMass Index calculated as: Weight (kg) / Height(meter-squared) - `heartRate` Beats/Min (Ventricular) - `glucose`: total glucose mg/dL - `TenYearCHD`: 0 = Patient doesn't have 10-year risk of future coronary heart disease; 1 = Patient has 10-year risk of future coronary heart disease ## Load and prepare data ```{r load-data} heart_data <- read_csv("https://raw.githubusercontent.com/sta210-sp20/datasets/master/framingham.csv") %>% drop_na() %>% #remove observations with missing values mutate(education = case_when( education == 1 ~ "Some HS", education == 2 ~ "HS or GED", education == 3 ~ "Some College", education == 4 ~ "College" ), TenYearCHD = as.factor(TenYearCHD), currentSmoker = as.factor(currentSmoker), ageCent = age - mean(age), totCholCent = totChol - mean(totChol), BPMeds = as.factor(BPMeds), prevalentStroke = as.factor(prevalentStroke), prevalentHyp = as.factor(prevalentHyp), male = as.factor(male), diabetes = as.factor(diabetes) ) ``` ## Fit logistic model ```{r} risk_m <- glm(TenYearCHD ~ ageCent + currentSmoker + totCholCent, data = heart_data, family = binomial) tidy(risk_m, conf.int = TRUE, exponentiate = FALSE) %>% kable(format = "markdown", digits = 3) ``` ## Predictions ### For a new patient ```{r} x0 <- data_frame(ageCent = (60 - 49.552), totCholCent = (263 - 236.848), currentSmoker = as.factor(0)) ``` ```{r} x0 ``` #### Predicted log-odds ```{r} predict(risk_m, x0) ``` #### Predicted probabilities ```{r} predict(risk_m, x0, type = "response") ``` Based on this probability, would you consider this patient as being high risk for getting coronary heart disease in the next 10 years? Why or why not? ### For all observations using `augment` ```{r} risk_m_aug <- augment(risk_m, type.predict = "response", type.residuals = "response") ``` ```{r} risk_m_aug ``` ## Confusion Matrix ```{r} threshold <- 0.1 ``` ```{r} risk_m_aug %>% mutate(risk_predict = if_else(.fitted > threshold, "1: Yes", "0: No")) %>% group_by(TenYearCHD, risk_predict) %>% summarise(n = n()) %>% kable(format="markdown") ``` ```{r} risk_m_aug %>% mutate(risk_predict = if_else(.fitted > threshold, "1: Yes", "0: No")) %>% select(TenYearCHD, risk_predict) ``` What proportion of observations were misclassified? What is the disadvantage of relying on the confusion matrix to assess the accuracy of the model? ## ROC Curve ```{r} (roc_curve <- ggplot(risk_m_aug, aes(d = as.numeric(TenYearCHD) - 1, m = .fitted)) + geom_roc(n.cuts = 10, labelround = 3) + geom_abline(intercept = 0) + labs(x = "False Positive Rate (1 - Specificity)", y = "True Positive Rate (Sensitivity)") ) ``` ```{r} calc_auc(roc_curve)$AUC ``` A doctor plans to use the results from your model to help select patients for a new heart disease prevention program. She asks you which threshold would be best to select patients for this program. Based on the ROC curve from the previous slide, which threshold would you recommend to the doctor? Why? ## Assumptions ### Why we don't plot raw residuals ```{r} ggplot(data = risk_m_aug, aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0, color = "red") + labs(x = "Predicted Values", y = "Raw Residuals") ``` ### Binned residual plots ```{r} nbins <- sqrt(nrow(risk_m_aug)) risk_m_aug %>% arrange(.fitted) %>% slice(1:60) %>% summarise(mean_resid = mean(.resid), #y axis mean_pred = mean(.fitted)) #x axis ``` ```{r} arm::binnedplot(x = risk_m_aug$.fitted, y = risk_m_aug$.resid, xlab = "Predicted Probabilities", main = "Binned Residual vs. Predicted Values", col.int = FALSE) ``` ```{r} arm::binnedplot(x = risk_m_aug$ageCent, y = risk_m_aug$.resid, col.int = FALSE, xlab = "Age (Mean-Centered)", main = "Binned Residual vs. Age") ``` ```{r} arm::binnedplot(x = risk_m_aug$totCholCent, y = risk_m_aug$.resid, col.int = FALSE, xlab = "Total Cholesterol (Mean-Centered)", main = "Binned Residual vs. Total Cholesterol") ``` ```{r} risk_m_aug %>% group_by(currentSmoker) %>% summarise(mean_resid = mean(.resid)) ``` ### Check assumptions: - Linearity? - Randomness? - Independence? ### Inference for coefficients ```{r} tidy(risk_m, conf.int = TRUE, exponentiate = FALSE) %>% kable(format = "markdown") ``` How is the test statistic for `currentSmoker1` calculated? Is `totCholCent` a statistically significant predictor of the odds a person is high risk for coronary heart disease? Justify your answer using the test statistic and p-value. Justify your answer using the confidence interval. ### Drop-in-Deviance Test ```{r} model_red <- glm(TenYearCHD ~ ageCent + currentSmoker + totChol, data = heart_data, family = binomial) model_full <- glm(TenYearCHD ~ ageCent + currentSmoker + totChol + education, data = heart_data, family = binomial) ``` ```{r} tidy(model_red) ``` ```{r} tidy(model_full) ``` ```{r echo = F} tidy(model_full, conf.int = T) %>% kable(format = "markdown", digits = 3) ``` ```{r} anova(model_red, model_full, test = "Chisq") ``` test statstic = Deviance = 2895.0 - 2887.2 P(ChiSquare > 7.7836) Chi-square distribution with 3 degrees of freedom small p-value, reject H0, at least one new term is statistically significant ### AIC ```{r} glance(model_red)$AIC glance(model_full)$AIC ``` Based on the drop-in-deviance test, which model would you select? Based on AIC, which model would you select? ### Model Selection using `step` ```{r} full_model <- glm(TenYearCHD ~ ., data = heart_data, family = "binomial") tidy(full_model) ``` ```{r} select_model <- step(full_model, direction = "backward") ``` ```{r} tidy(select_model, conf.int = TRUE, exponentiate = FALSE) %>% kable(format = "markdown", digits = 3) ``` Age currentSmoker Age*CurrentSmoker ## References Data obtained from [https://www.kaggle.com/amanajmera1/framingham-heart-study-dataset](https://www.kaggle.com/amanajmera1/framingham-heart-study-dataset)