Note: There will be no grace period for this assignment. With that said, this assignment will only benefit your score. If you turn this HW in and it helps your HW average, I will accept it. If this assignment does not help your grade, then I’ll just ignore it completely (it won’t count as your lowest dropped grade; I’ll just pretend HW 09 didn’t exist for you).

However, logistic regression is very important, so I highly encourage you to try this assignment and at least look at the solutions when they are available.

Fitting logistic regression models

In R, we use the glm() function to fit generalized linear models, including logistic regression models. It is often useful to save the “model object” as a separate object in order to use it for downstream analyses. In general, the syntax for logistic regression is

model <- glm(outcome ~ predictor1 + predictor 2 + ..., data = dataset, family = "binomial")

The family = "binomial" portion tells R that you are fitting a logistic regression model. It is important that the outcome variable is binary and only takes on values of 0 or 1. You may need to modify your outcome variable to satisfy this requirement.

For instance, let’s fit a logistic regression model using the PBC data (note, the dataset we’re using is actually modified from the original), where our response variable is whether a patient died during the follow-up time (outcome) and the three predictor variables are sex (sex), serum cholesterol levels (chol), and histologic stage of disease (stage). Suppose we want to save it as the object mod1. We would use the following code (you will have to load in the pbc dataset):

library(tidyverse)
library(broom)

pbc <- read.csv("https://www2.stat.duke.edu/courses/Fall20/sta102/hw/data/pbc.csv")

mod1 <- glm(outcome ~ sex + chol + as.factor(stage), data = pbc, family = "binomial")

Note that the stage variable had numeric values in the raw data, but we used as.factor() to coerce it into a categorical variable in R.

Once again, we can use the broom package from the tidyverse to clean up some of the regression output from standard R functions. We will also use some other functions from the tidyverse. You may need to first install these packages before proceeding.

To see a neat summary of the model, including parameter estimates, standard errors, the test statistic, and p-value for each, we can use the tidy() function:

tidy(mod1)

## # A tibble: 6 x 5
##   term              estimate std.error statistic p.value
##   <chr>                <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)       -3.45     1.13         -3.06 0.00221
## 2 sexm               1.23     0.429         2.87 0.00413
## 3 chol               0.00233  0.000736      3.17 0.00153
## 4 as.factor(stage)2  1.54     1.13          1.36 0.173  
## 5 as.factor(stage)3  2.13     1.11          1.92 0.0553 
## 6 as.factor(stage)4  3.19     1.12          2.86 0.00419

Let’s suppose we have a patient who is male, has cholesterol levels of 390 mg/dL (yikes), and has stage 3 disease. What would be the predicted log-odds that they will die during follow-up?

First, create a new tibble with these data:

patient <- tibble(sex = "m",
                  chol = 390,
                  stage = 4)

Next, we can use the augment function to predict their log-odds using the mod1 model object we created earlier:

augment(mod1, newdata = patient)

## # A tibble: 1 x 4
##   sex    chol stage .fitted
##   <chr> <dbl> <dbl>   <dbl>
## 1 m       390     4    1.89

The predicted log-odds are approximately 1.89.

Exercises

We will once again use hourly weather station readings from Beijing. A modified subset of 1500 of these observations is given in the beijing dataset, which has been adapted from data provided by Dr. Songxi Chen at Peking University (Proc. Roy. Soc. A: 473(2205): 2017.0457 ).

Of interest are the following variables:

TEMP: temperature in degrees Celsius
PRES: barometric pressure in in hPa
DEWP: dew point in degrees Celsius
RAIN: hourly precipitation levels in mm
wd: wind direction (eight compass points)

You may read in the data with the following code. Be sure to install the broom package if you haven’t yet!

library(tidyverse)
library(broom)
beijing <- read.csv("https://www2.stat.duke.edu/courses/Fall20/sta102/hw/data/beijing.csv")

15 points Fit a model which predicts whether there is ANY rain (i.e., non-zero) during a particular station reading based on the temperature, barometric pressure, dew point, and wind direction. Display a tidy summary table for your model. You may need to create a new variable.
15 points Using your model from Exercise 1, interpret the slope estimate corresponding to TEMP and the slope estimate corresponding to one of the dummy variables (your choice).
15 points Is there evidence that temperature might be associated with whether there’s any rain at these weather stations? Comprehensively explain your answer with a formal hypothesis test.
20 points Construct a 95% confidence interval for the odds ratio for having any rain comparing days when the wind is coming from the northwest compared to days when the wind is coming from the east. Does there appear to be evidence of a significant association?
20 points Suppose we have a “perfectly average” day, with temperature, pressure, and dew point equal to their average values in the dataset and wind coming from the E. What is the probability that it is raining?

The Veterans’ Administration Lung Cancer Study is a classic dataset tracking patients with lung cancer over approximately three years. The following plot depicts the Kaplan-Meier curves for survival based whether a participant was older than 60 (dashed line) or younger than 60 years old (solid line) at enrollment.

5 points Provide an appropriate label for the x- and y-axes as well as a title for the plot.
5 points What is the estimated median survival time for patients older than 60?
5 points A log-rank test was performed to evaluate whether there was a difference in survival curves between younger and older participants. The p-value was 0.085. Interpret this result in the context of the data.

HW 09

STA 102 Spring 2020 (Jiang)

Due: Monday, November 16, 2020 at 11:59p

Fitting logistic regression models

Exercises