The primary goal of today’s lab is to give you practice with some of the tools you will need to conduct regression analysis using R. An additional goal for today is for you to be introduced to your teams and practice collaborating using GitHub and RStudio.
Each of your assignments will begin with the following steps.
Click on the green Clone or download button, select Use HTTPS (this might already be selected by default, and if it is, you’ll see the text Clone with HTTPS as in the image below). Click on the clipboard icon to copy the repo URL.
Go to https://vm-manage.oit.duke.edu/containers and login with your Duke NetId and Password.
Click to log into the Docker container STA 210 - Regression Analysis. You should now see the RStudio environment.
Go to File ➡️ New Project ➡️ Version Control ➡️ Git.
Copy and paste the URL of your assignment repo into the dialog box Repository URL. You can leave Project Directory Name empty. It will default to the name of the GitHub repo.
Click Create Project, and the files from your GitHub repo will be displayed the Files pane in RStudio.
If you are unable to push to GitHub, it may be because you need to configure git. Follow the steps below to configure git.
To do so, you will use the use_git_config()
function from the usethis
package.
Type the following lines of code in the console in RStudio filling in your name and email address.
The email address is the one tied to your GitHub account.
For example, mine would be
If you get the error message
then you need to install the usethis
package. Run the following code in the console to install the package. Then, rerun the use_git_config
function with your GitHub username and email address associated with your GitHub account.
Once you run the configuration code, your values for user.name
and user.email
will display in the console. If your user.name
and user.email
are correct, you’re good to go! Otherwise, run the code again with the necessary changes.
Pick one team member to complete the steps in this section while the others contribute to the discussion but do not actually touch the files on their computer.
Before we introduce the data, let’s warm up with a simple exercise.
Open the R Markdown (Rmd) file in your project, change the author name to your team name, and knit the document.
Now, the remaining team members who have not been concurrently making these changes on their projects should click on the Pull button in their Git pane and observe that the changes are now reflected on their projects as well.
If you need a more detailed review on committing and pushing changes, see Getting Starte from Lab 01.
In today’s lab, we will analyze the elmhurst
dataset in the openintro package. This dataset contains information about 50 randomly selected students from the 2011 freshmen class at Elmhurst College. The data were originally sampled from a table on all 2011 freshmen at the college that was included in the article “What Students Really Pay to go to College” in The Chronicle of Higher Education article.
You can load the data from loading the openintro package, and then running the following command:
The elmhurst
dataset contains the following variables:
family_income |
Family income of the student |
gift_aid |
Gift aid, in ($ thousands) |
price_paid |
Price paid by the student (= tuition - gift_aid) |
To help get used to the group workflow, there are markers throughout the lab to indicate when each team member should be typing. The other group members should still contribute to the discussion when they are not typing; however, they should not change anything in RStudio.
Team Member 1: Your turn to type!
Plot a histogram to examine the distribution of gift_aid
. What is the approximate shape of the distribution? Also note if there are any outliers in the dataset.
To better understand the distribution of gift_aid
, we would like calculate measures of center and spread of the distribution. Use the summarise
function to calculate the appropriate measures of center (mean or median) and spread (standard deviation or IQR) based on the shape of the distribution from Exercise 1. Show the code and output, and state the measures of center and spread in your narrative. Be sure to report your conclusions for this exercise and the remainder of the lab in dollars.
Plot the distribution of family_income
and calculate the appropriate summary statistics. Describe the distribution of family_income
(shape, center, and spread, outliers) using the plot and appropriate summary statistics.
Create a scatterplot to display the relationship between gift_aid
(response variable) and family_income
(predictor variable). Use the scatterplot to describe the relationship between the two variables. Be sure the scatterplot includes informative axis labels and title.
Team Member 1: Knit, commit, and push your work.
All team members: Pull to see the updated .Rmd and .pdf files.
Team Member 2: Your turn to type!
lm
function to fit a simple linear regression model using family_income
to explain variation in gift_aid
. Complete the code below to assign your model a name, and use the tidy
and kable
functions to neatly display the model output. Replace X and Y with the appropriate variable names._____ <- lm(Y ~ X, data = _____)
tidy(_____) %>% # output model
kable(digits = 3) # format model output
Interpret the slope in the context of the problem.
When we fit a linear regression model, we make assumptions about the underlying relationship between the response and predictor variables. In practice, we can check that the assumptions hold by analyzing the residuals. Over the next few questions, we will examine plots of the residuals to determine if the assumptions are met.
See Checking Model Assumptions for more details. We will also discuss these in class.
Let’s begin by calculating the residuals and adding them to the dataset. Fill in the model name in the code below to add residuals to the original dataset using the resid()
and mutate()
functions.
One of the assumptions for regression is that there is a linear relationship between the predictor and response variables. To check this assumption, we will examine a scatterplot of the residuals versus the predictor variable.
Create a scatterplot with the predictor variable on the x axis and residuals on the y axis. Be sure to include an informative title and properly label the axes.
Team Member 2: Knit, commit, and push your work.
All team members: Pull to see the updated .Rmd and .pdf files.
Team Member 3: Your turn to type!
Examine the plot from the previous question to assess the linearity condition.
Based on this, is the linearity condition is satisfied? Briefly explain your reasoning.
Recall that when we fit a regression model, we assume for any given value of \(x\), the \(y\) values follow the Normal distribution with mean \(\beta_0 + \beta_1 x\) and variance \(\sigma^2\). We will look at two sets of plots to check that this assumption holds.
We begin by checking the constant variance assumption, i.e that the variance of \(y\) is approximately equal for each value of \(x\). To check this, we will use the scatterplot of the residuals versus the predictor variable \(x\). Ideally, as we move from left to right, the spread of the \(y\)’s will be approximately equal, i.e. there is no “fan” pattern.
Using the scatterplot from Exercise 8 , is the constant variance assumption satisfied? Briefly explain your reasoning. Note: You don’t need to know the value of \(\sigma^2\) to answer this question.
Next, we will assess with Normality assumption, i.e. that the distribution of the \(y\) values is Normal at every value of \(x\). In practice, it is impossible to check the distribution of \(y\) at every possible value of \(x\), so we can check whether the assumption is satisfied by looking at the overall distribution of the residuals. The assumption is satisfied if the distribution of residuals is approximately Normal, i.e. unimodal and symmetric.
Make a histogram of the residuals. Based on the histogram, is the Normality assumption satisfied? Briefly explain your reasoning.
Team Member 3: Knit, commit, and push your work.
All team members: Pull to see the updated .Rmd and .pdf files.
Team Member 4: Your turn to type!
Next week, we will discuss how to account for the uncertainty in the prediction using a prediction interval.
Suppose a high school senior is considering Elmhurst College, and she would like to use your regression model to estimate how much gift aid she can expect to receive. Her family income is $90,000. Based on your model, about how much gift aid should she expect to receive? Show the code or calculations you use to get the prediction.
Another high school senior is considering Elmhurst College, and her family income is about $310,000. Do you think it would be wise to use your model calculate the predicted gift aid for this student? Briefly explain your reasoning.
Team Member 4: Knit, commit, and push your work.
All team members: Pull to see the updated .Rmd and .pdf files.
You’re done and ready to submit your work! Knit, commit, and push all remaining changes. You can use the commit message “Done with Lab 2!”, and make sure you have pushed all the files to GitHub (your Git pane in RStudio should be empty) and that all documents are updated in your repo on GitHub. Then submit the assignment on Gradescope following the instructions below.
Once your work is finalized in your GitHub repo, you will submit it to Gradescope. Your assignment must be submitted on Gradescope by the deadline to be considered “on time”.
To submit your assignment:
Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials ➡️ Duke NetID and log in using your NetID credentials.
Click on the STA 210 Regression Analysis course.
Click on the assignment, and you’ll be prompted to submit it.
Select your assignment repo and choose “master” for the branch.
Make sure to include the names of all group members who participated in the assignment. Click here for help on adding group members to an assignment.
Click Upload. You should receive an email to confirm that the assignment has been submitted.
Notes:
Exploratory Data Analysis | 15 |
Regression Model & Assumptions | 20 |
Using the Model | 8 |
Lab attendance & participation | 3 |
Narrative in full sentences | 2 |
Commit messages from every member | 2 |
Total | 50 |