Lab 02: Simple Linear Regression

due Tue, Jan 28 at 11:59p

The primary goal of today’s lab is to give you practice with some of the tools you will need to conduct regression analysis using R. An additional goal for today is for you to be introduced to your teams and practice collaborating using GitHub and RStudio.

Getting Started

Each of your assignments will begin with the following steps.

Clone Assignment Repo

Clone the repo & start new RStudio project

Configure git

If you are unable to push to GitHub, it may be because you need to configure git. Follow the steps below to configure git.

To do so, you will use the use_git_config() function from the usethis package.

Type the following lines of code in the console in RStudio filling in your name and email address.

The email address is the one tied to your GitHub account.

library(usethis)
use_git_config(user.name = "GitHub username", user.email="your email")

For example, mine would be

library(usethis)
use_git_config(user.name="matackett", user.email="maria.tackett@duke.edu")

If you get the error message

Error in library(usethis) : there is no package called ‘usethis’

then you need to install the usethis package. Run the following code in the console to install the package. Then, rerun the use_git_config function with your GitHub username and email address associated with your GitHub account.

install.package("usethis")

Once you run the configuration code, your values for user.name and user.email will display in the console. If your user.name and user.email are correct, you’re good to go! Otherwise, run the code again with the necessary changes.

Packages

We will use the following packages in today’s lab.

library(tidyverse)
library(knitr)
library(broom)
library(modelr)
library(openintro) #package containing dataset

Warm up

Pick one team member to complete the steps in this section while the others contribute to the discussion but do not actually touch the files on their computer.

Before we introduce the data, let’s warm up with a simple exercise.

YAML:

Open the R Markdown (Rmd) file in your project, change the author name to your team name, and knit the document.

Committing and pushing changes:

Pulling changes:

Now, the remaining team members who have not been concurrently making these changes on their projects should click on the Pull button in their Git pane and observe that the changes are now reflected on their projects as well.

If you need a more detailed review on committing and pushing changes, see Getting Starte from Lab 01.

Data: Gift aid at Elmhurst College

In today’s lab, we will analyze the elmhurst dataset in the openintro package. This dataset contains information about 50 randomly selected students from the 2011 freshmen class at Elmhurst College. The data were originally sampled from a table on all 2011 freshmen at the college that was included in the article “What Students Really Pay to go to College” in The Chronicle of Higher Education article.

You can load the data from loading the openintro package, and then running the following command:

data(elmhurst)

The elmhurst dataset contains the following variables:

family_income Family income of the student
gift_aid Gift aid, in ($ thousands)
price_paid Price paid by the student (= tuition - gift_aid)

Exercises

To help get used to the group workflow, there are markers throughout the lab to indicate when each team member should be typing. The other group members should still contribute to the discussion when they are not typing; however, they should not change anything in RStudio.

Exploratory Data Analysis

Team Member 1: Your turn to type!

  1. Plot a histogram to examine the distribution of gift_aid. What is the approximate shape of the distribution? Also note if there are any outliers in the dataset.

  2. To better understand the distribution of gift_aid, we would like calculate measures of center and spread of the distribution. Use the summarise function to calculate the appropriate measures of center (mean or median) and spread (standard deviation or IQR) based on the shape of the distribution from Exercise 1. Show the code and output, and state the measures of center and spread in your narrative. Be sure to report your conclusions for this exercise and the remainder of the lab in dollars.

  3. Plot the distribution of family_income and calculate the appropriate summary statistics. Describe the distribution of family_income (shape, center, and spread, outliers) using the plot and appropriate summary statistics.

  4. Create a scatterplot to display the relationship between gift_aid (response variable) and family_income (predictor variable). Use the scatterplot to describe the relationship between the two variables. Be sure the scatterplot includes informative axis labels and title.

Simple Linear Regression

Team Member 1: Knit, commit, and push your work.

All team members: Pull to see the updated .Rmd and .pdf files.

Team Member 2: Your turn to type!

  1. Use the lm function to fit a simple linear regression model using family_income to explain variation in gift_aid. Complete the code below to assign your model a name, and use the tidy and kable functions to neatly display the model output. Replace X and Y with the appropriate variable names.
_____ <- lm(Y ~ X, data = _____)
tidy(_____) %>% # output model
  kable(digits = 3) # format model output
  1. Interpret the slope in the context of the problem.

  2. When we fit a linear regression model, we make assumptions about the underlying relationship between the response and predictor variables. In practice, we can check that the assumptions hold by analyzing the residuals. Over the next few questions, we will examine plots of the residuals to determine if the assumptions are met.

    See Checking Model Assumptions for more details. We will also discuss these in class.

    Let’s begin by calculating the residuals and adding them to the dataset. Fill in the model name in the code below to add residuals to the original dataset using the resid() and mutate() functions.

_____<- _____ %>%
  mutate(resid = residuals(_____))
  1. One of the assumptions for regression is that there is a linear relationship between the predictor and response variables. To check this assumption, we will examine a scatterplot of the residuals versus the predictor variable.

    Create a scatterplot with the predictor variable on the x axis and residuals on the y axis. Be sure to include an informative title and properly label the axes.

Team Member 2: Knit, commit, and push your work.

All team members: Pull to see the updated .Rmd and .pdf files.

Team Member 3: Your turn to type!

  1. Examine the plot from the previous question to assess the linearity condition.

    • Ideally, there would be no discernible shape in the plot. This is an indication that the linear model adequately describes the relationship between the response and predictor, and all that is left is the random error that can’t be accounted for in the model, i.e. other things that affect gift aid besides family income.
    • If there is an obvious shape in the plot (e.g. a parabola), this means that the linear model does not adequately describe the relationship between the response and predictor variables.

    Based on this, is the linearity condition is satisfied? Briefly explain your reasoning.

  2. Recall that when we fit a regression model, we assume for any given value of \(x\), the \(y\) values follow the Normal distribution with mean \(\beta_0 + \beta_1 x\) and variance \(\sigma^2\). We will look at two sets of plots to check that this assumption holds.

    We begin by checking the constant variance assumption, i.e that the variance of \(y\) is approximately equal for each value of \(x\). To check this, we will use the scatterplot of the residuals versus the predictor variable \(x\). Ideally, as we move from left to right, the spread of the \(y\)’s will be approximately equal, i.e. there is no “fan” pattern.

    Using the scatterplot from Exercise 8 , is the constant variance assumption satisfied? Briefly explain your reasoning. Note: You don’t need to know the value of \(\sigma^2\) to answer this question.

  3. Next, we will assess with Normality assumption, i.e. that the distribution of the \(y\) values is Normal at every value of \(x\). In practice, it is impossible to check the distribution of \(y\) at every possible value of \(x\), so we can check whether the assumption is satisfied by looking at the overall distribution of the residuals. The assumption is satisfied if the distribution of residuals is approximately Normal, i.e. unimodal and symmetric.

    Make a histogram of the residuals. Based on the histogram, is the Normality assumption satisfied? Briefly explain your reasoning.

Team Member 3: Knit, commit, and push your work.

All team members: Pull to see the updated .Rmd and .pdf files.

Team Member 4: Your turn to type!

  1. The final assumption is that the observations are independent, i.e. one observation does not affect another. We can typically make an assessment about this assumption using a description of the data. Do you think the independence assumption is satisfied? Briefly explain your reasoning.

Using the Model

  1. Calculate \(R^2\) for this model and interpret it in the context of the data.

Next week, we will discuss how to account for the uncertainty in the prediction using a prediction interval.

  1. Suppose a high school senior is considering Elmhurst College, and she would like to use your regression model to estimate how much gift aid she can expect to receive. Her family income is $90,000. Based on your model, about how much gift aid should she expect to receive? Show the code or calculations you use to get the prediction.

  2. Another high school senior is considering Elmhurst College, and her family income is about $310,000. Do you think it would be wise to use your model calculate the predicted gift aid for this student? Briefly explain your reasoning.

Team Member 4: Knit, commit, and push your work.

All team members: Pull to see the updated .Rmd and .pdf files.



You’re done and ready to submit your work! Knit, commit, and push all remaining changes. You can use the commit message “Done with Lab 2!”, and make sure you have pushed all the files to GitHub (your Git pane in RStudio should be empty) and that all documents are updated in your repo on GitHub. Then submit the assignment on Gradescope following the instructions below.

Submitting the Assignment

Once your work is finalized in your GitHub repo, you will submit it to Gradescope. Your assignment must be submitted on Gradescope by the deadline to be considered “on time”.

To submit your assignment:

Notes:

Grading

Exploratory Data Analysis 15
Regression Model & Assumptions 20
Using the Model 8
Lab attendance & participation 3
Narrative in full sentences 2
Commit messages from every member 2
Total 50