Note: There will be no grace period for this assignment. With that said, this assignment will only benefit your score. If you turn this HW in and it helps your HW average, I will accept it. If this assignment does not help your grade, then I’ll just ignore it completely (it won’t count as your lowest dropped grade; I’ll just pretend HW 09 didn’t exist for you).
However, logistic regression is very important, so I highly encourage you to try this assignment and at least look at the solutions when they are available.
In R, we use the glm() function to fit generalized linear models, including logistic regression models. It is often useful to save the “model object” as a separate object in order to use it for downstream analyses. In general, the syntax for logistic regression is
model <- glm(outcome ~ predictor1 + predictor 2 + ..., data = dataset, family = "binomial")
The family = "binomial" portion tells R that you are fitting a logistic regression model. It is important that the outcome variable is binary and only takes on values of 0 or 1. You may need to modify your outcome variable to satisfy this requirement.
For instance, let’s fit a logistic regression model using the PBC data (note, the dataset we’re using is actually modified from the original), where our response variable is whether a patient died during the follow-up time (outcome) and the three predictor variables are sex (sex), serum cholesterol levels (chol), and histologic stage of disease (stage). Suppose we want to save it as the object mod1. We would use the following code (you will have to load in the pbc dataset):
library(tidyverse)
library(broom)
pbc <- read.csv("https://www2.stat.duke.edu/courses/Fall20/sta102/hw/data/pbc.csv")
mod1 <- glm(outcome ~ sex + chol + as.factor(stage), data = pbc, family = "binomial")
Note that the stage variable had numeric values in the raw data, but we used as.factor() to coerce it into a categorical variable in R.
Once again, we can use the broom package from the tidyverse to clean up some of the regression output from standard R functions. We will also use some other functions from the tidyverse. You may need to first install these packages before proceeding.
To see a neat summary of the model, including parameter estimates, standard errors, the test statistic, and p-value for each, we can use the tidy() function:
tidy(mod1)
## # A tibble: 6 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -3.45 1.13 -3.06 0.00221
## 2 sexm 1.23 0.429 2.87 0.00413
## 3 chol 0.00233 0.000736 3.17 0.00153
## 4 as.factor(stage)2 1.54 1.13 1.36 0.173
## 5 as.factor(stage)3 2.13 1.11 1.92 0.0553
## 6 as.factor(stage)4 3.19 1.12 2.86 0.00419
Let’s suppose we have a patient who is male, has cholesterol levels of 390 mg/dL (yikes), and has stage 3 disease. What would be the predicted log-odds that they will die during follow-up?
First, create a new tibble with these data:
patient <- tibble(sex = "m",
chol = 390,
stage = 4)
Next, we can use the augment function to predict their log-odds using the mod1 model object we created earlier:
augment(mod1, newdata = patient)
## # A tibble: 1 x 4
## sex chol stage .fitted
## <chr> <dbl> <dbl> <dbl>
## 1 m 390 4 1.89
The predicted log-odds are approximately 1.89.
We will once again use hourly weather station readings from Beijing. A modified subset of 1500 of these observations is given in the beijing dataset, which has been adapted from data provided by Dr. Songxi Chen at Peking University (Proc. Roy. Soc. A: 473(2205): 2017.0457 ).
Of interest are the following variables:
TEMP: temperature in degrees CelsiusPRES: barometric pressure in in hPaDEWP: dew point in degrees CelsiusRAIN: hourly precipitation levels in mmwd: wind direction (eight compass points)You may read in the data with the following code. Be sure to install the broom package if you haven’t yet!
library(tidyverse)
library(broom)
beijing <- read.csv("https://www2.stat.duke.edu/courses/Fall20/sta102/hw/data/beijing.csv")
TEMP and the slope estimate corresponding to one of the dummy variables (your choice).The Veterans’ Administration Lung Cancer Study is a classic dataset tracking patients with lung cancer over approximately three years. The following plot depicts the Kaplan-Meier curves for survival based whether a participant was older than 60 (dashed line) or younger than 60 years old (solid line) at enrollment.