In R, we use the lm() function to fit linear models. It is often useful to save the “model object” as a separate object in order to use it for downstream analyses. In general, the syntax is
model <- lm(outcome ~ predictor1 + predictor 2 + ..., data = dataset)
For instance, let’s fit a linear model using the CDC data, where our response variable is obesity percentage and the three predictor variables are adequate exercise percentage, smoking percentage, and HDI category, and suppose we want to save it as the object mod1. We would use the following code:
cdc <- read.csv("https://www2.stat.duke.edu/courses/Fall20/sta102/hw/data/cdc.csv")
mod1 <- lm(obesity ~ exercise + smoking + hdi, data = cdc)
We will be using the broom package from the tidyverse to clean up (ha) some of the regression output from standard R functions. We will also use some other functions from the tidyverse. You may need to first install these packages before proceeding:
library(tidyverse)
library(broom)
First, let’s take a look at the model output with the tidy function. This function is applied to a linear model object such as the mod1 object we saved earlier:
tidy(mod1)
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 37.6 3.57 10.5 9.66e-14
## 2 exercise -0.281 0.0539 -5.22 4.46e- 6
## 3 smoking 0.431 0.0922 4.68 2.68e- 5
## 4 hdiMiddle -0.724 0.794 -0.912 3.67e- 1
## 5 hdiTop ten -2.78 1.00 -2.77 8.07e- 3
We see that the output provides columns containing our coefficient estimates, their standard errors, the t-statistic corresponding to each, and the p-value.
Note that R automatically recognized hdi as a categorical variable since its values are character strings, and created the appropriate dummy variables (treating the first category in alphabetical order as the baseline value). If you have a categorical variable that takes on numerical values, you can use the as.factor function in the model call to convert a variable into a categorical variable (which R will create dummy variables from). For instance:
model <- lm(outcome + predictor1 + as.factor(predictor2) + ..., data = dataset)
As well, we can get model-specific values such as the \(R^2\), adjusted \(R^2\), the F-statistic (and corresponding p-value), and some other statistics by using the glance function applied to our model object. Note that the df given in the table is the numerator degrees of freedom for the overall F test.
glance(mod1)
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.805 0.787 1.67 46.3 2.11e-15 4 -94.1 200. 212.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
The augment() function from the broom package gives us some additional information relating to our linear model. Of primary interest are the fitted values and the residuals for each observation. The fitted values are the predicted response values given the values for our explanatory variables, and the residuals are the difference between the actual response data points and their fitted (predicted) values from the model.
Let’s apply the augment function to the mod1 object we created earlier, and save it as a new dataframe, mod1_res (which contains the residuals):
mod1_res <- augment(mod1)
The variables .fitted and .resid (note the period) variables from this dataframe are what we will be using to evaluate some model assumptions. For instance, we could evaluate the normality of the residuals by plotting a histogram, or evaluate linearity/homogeneity of variances by plotting a scatterplot of the fitted values vs. the residuals.
For instance, we can create a ggplot histogram of the .resid to get a histogram of the residuals:
ggplot(data = mod1_res, mapping = aes(x = .resid)) +
geom_histogram(binwidth = 1) +
labs(x = "Residuals", y = "Count",
title = "Residuals are slightly right-skewed")
Or a scatterplot of the fitted values vs. the residuals (the residual plot):
ggplot(data = mod1_res, mapping = aes(x = .fitted, y = .resid)) +
geom_point() +
labs(x = "Fitted values", y = "Residuals",
title = "Residual plot shows no clear pattern")
Interactions in models are made by simplying multiplying two variables of interest together. As an example:
mod_int <- lm(outcome ~ var1 + var2 + var1 * var2, data = dataset)
Note the var1 * var2 term in the model above.
Atmospheric PM2.5 is fine particulate matter that has a diameter of less than 2.5 micrometers, and is one of the major air pollutants in the atmosphere. Such particles penetrate deep into the lungs due to their small size, and are associated with cardiovascular disease and other adverse health outcomes. Thus, higher PM2.5 levels imply lower air quality.
Scientists are interested in evaluating whether atmospheric PM2.5 levels can be predicted by weather patterns. Hourly data of PM2.5 readings from 2013-2017 were captured at a weather station at Nongzhanguan in Chaoyang District, Beijing. A modified subset of 1500 of these observations is given in the beijing dataset, which has been adapted from data provided by Dr. Songxi Chen at Peking University (Proc. Roy. Soc. A: 473(2205): 2017.0457 ).
Of interest are the following variables:
PM2.5: PM10 levels in micrograms per cubic meterTEMP: temperature in degrees CelsiusPRES: barometric pressure in in hPaDEWP: dew point in degrees CelsiusRAIN: hourly precipitation levels in mmwd: wind direction (eight compass points)You may read in the data with the following code. Be sure to install the broom package if you haven’t yet!
library(tidyverse)
library(broom)
beijing <- read.csv("https://www2.stat.duke.edu/courses/Fall20/sta102/hw/data/beijing.csv")
TEMP, and the slope corresponding to wdSW.Note: the t statistic for a single predictor in a multiple regression model has \(n − k\) degrees of freedom, where n is the number of observations, and k is the number of parameters being estimated (including the intercept).