Fitting linear models

In R, we use the lm() function to fit linear models. It is often useful to save the “model object” as a separate object in order to use it for downstream analyses. In general, the syntax is

model <- lm(outcome ~ predictor1 + predictor 2 + ..., data = dataset)

For instance, let’s fit a linear model using the CDC data, where our response variable is obesity percentage and the three predictor variables are adequate exercise percentage, smoking percentage, and HDI category, and suppose we want to save it as the object mod1. We would use the following code:

cdc <- read.csv("https://www2.stat.duke.edu/courses/Fall20/sta102/hw/data/cdc.csv")
mod1 <- lm(obesity ~ exercise + smoking + hdi, data = cdc)

We will be using the broom package from the tidyverse to clean up (ha) some of the regression output from standard R functions. We will also use some other functions from the tidyverse. You may need to first install these packages before proceeding:

library(tidyverse)
library(broom)

Model estimates

First, let’s take a look at the model output with the tidy function. This function is applied to a linear model object such as the mod1 object we saved earlier:

tidy(mod1)
## # A tibble: 5 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   37.6      3.57      10.5   9.66e-14
## 2 exercise      -0.281    0.0539    -5.22  4.46e- 6
## 3 smoking        0.431    0.0922     4.68  2.68e- 5
## 4 hdiMiddle     -0.724    0.794     -0.912 3.67e- 1
## 5 hdiTop ten    -2.78     1.00      -2.77  8.07e- 3

We see that the output provides columns containing our coefficient estimates, their standard errors, the t-statistic corresponding to each, and the p-value.

Note that R automatically recognized hdi as a categorical variable since its values are character strings, and created the appropriate dummy variables (treating the first category in alphabetical order as the baseline value). If you have a categorical variable that takes on numerical values, you can use the as.factor function in the model call to convert a variable into a categorical variable (which R will create dummy variables from). For instance:

model <- lm(outcome + predictor1 + as.factor(predictor2) + ..., data = dataset)

As well, we can get model-specific values such as the \(R^2\), adjusted \(R^2\), the F-statistic (and corresponding p-value), and some other statistics by using the glance function applied to our model object. Note that the df given in the table is the numerator degrees of freedom for the overall F test.

glance(mod1)
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.805         0.787  1.67      46.3 2.11e-15     4  -94.1  200.  212.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Evaluating residuals

The augment() function from the broom package gives us some additional information relating to our linear model. Of primary interest are the fitted values and the residuals for each observation. The fitted values are the predicted response values given the values for our explanatory variables, and the residuals are the difference between the actual response data points and their fitted (predicted) values from the model.

Let’s apply the augment function to the mod1 object we created earlier, and save it as a new dataframe, mod1_res (which contains the residuals):

mod1_res <- augment(mod1)

The variables .fitted and .resid (note the period) variables from this dataframe are what we will be using to evaluate some model assumptions. For instance, we could evaluate the normality of the residuals by plotting a histogram, or evaluate linearity/homogeneity of variances by plotting a scatterplot of the fitted values vs. the residuals.

For instance, we can create a ggplot histogram of the .resid to get a histogram of the residuals:

ggplot(data = mod1_res, mapping = aes(x = .resid)) + 
  geom_histogram(binwidth = 1) + 
  labs(x = "Residuals", y = "Count",
       title = "Residuals are slightly right-skewed")

Or a scatterplot of the fitted values vs. the residuals (the residual plot):

ggplot(data = mod1_res, mapping = aes(x = .fitted, y = .resid)) + 
  geom_point() + 
  labs(x = "Fitted values", y = "Residuals",
       title = "Residual plot shows no clear pattern")

Interactions

Interactions in models are made by simplying multiplying two variables of interest together. As an example:

mod_int <- lm(outcome ~ var1 + var2 + var1 * var2, data = dataset)

Note the var1 * var2 term in the model above.

Exercises

Atmospheric PM2.5 is fine particulate matter that has a diameter of less than 2.5 micrometers, and is one of the major air pollutants in the atmosphere. Such particles penetrate deep into the lungs due to their small size, and are associated with cardiovascular disease and other adverse health outcomes. Thus, higher PM2.5 levels imply lower air quality.

Scientists are interested in evaluating whether atmospheric PM2.5 levels can be predicted by weather patterns. Hourly data of PM2.5 readings from 2013-2017 were captured at a weather station at Nongzhanguan in Chaoyang District, Beijing. A modified subset of 1500 of these observations is given in the beijing dataset, which has been adapted from data provided by Dr. Songxi Chen at Peking University (Proc. Roy. Soc. A: 473(2205): 2017.0457 ).

Of interest are the following variables:

You may read in the data with the following code. Be sure to install the broom package if you haven’t yet!

library(tidyverse)
library(broom)
beijing <- read.csv("https://www2.stat.duke.edu/courses/Fall20/sta102/hw/data/beijing.csv")
  1. 10 points Fit a linear model that predicts PM2.5 levels based on the temperature, pressure, dew point, precipitation, and wind direction. Write out the fitted model (in words and/or symbols). Do not include interactions. You may have to write your model on multiple lines for it to fit on the page – please check before submitting!
  2. 20 points Comprehensively assess whether all assumptions for your linear model in Exercise 1 are satisfied.
  3. 6 points What PM2.5 level would you predict for a day with temperature of 11 degrees C, pressure of 1040 hPa, dew point of 5 degrees C, with no rain and wind coming from the south?
  4. 15 points Provide and interpret the intercept estimate, the slope corresponding to TEMP, and the slope corresponding to wdSW.
  5. 15 points Regardless of whether the assumptions in Exercise 2 are satisfied, comprehensively evaluate the following question: is there sufficient evidence that there is a relationship between barometric pressure and PM2.5 levels, while adjusting for temperature, dew point, rain, and wind direction? Explain.

Note: the t statistic for a single predictor in a multiple regression model has \(n − k\) degrees of freedom, where n is the number of observations, and k is the number of parameters being estimated (including the intercept).

  1. 10 points Fit a new linear model that aims to predict PM2.5 levels based on the temperature, barometric pressure, and their interaction. Write out the fitted model (in words and/or symbols).
  2. 10 points What would be the expected difference in PM2.5 levels given a one degree increase in temperature?
  3. 8 points Is there sufficient evidence that the relationship between temperature and PM2.5 depends on the barometric pressure? Briefly explain.
  4. 6 points Provide and interpret your \(R^2\) values from the models in Exercises 1 and 6.