The primary goal of today’s lab is to give you practice with some of the tools you will need to conduct regression analysis using R. An additional goal for today is for you to be introduced to your teams and practice collaborating using GitHub and RStudio.
Each of your assignments will begin with the following steps.
Go to the STA210-Sp19 organization on GitHub (https://github.com/STA210-Sp19). Click on the repo with the prefix lab-02-slr. It contains the starter documents you need to complete the lab.
Click on the green Clone or download button, select Use HTTPS (this might already be selected by default, and if it is, you’ll see the text Clone with HTTPS as in the image below). Click on the clipboard icon to copy the repo URL.
Go to RStudio Cloud and into the STA 210 course workspace. Create a New Project from Git Repo. You will need to click on the down arrow next to the New Project button to see this option.
Copy and paste the URL of your assignment repo into the dialog box. Be sure “Add packages from the base project” is checked.
Click OK, and you should see the contents from your GitHub repo in the Files pane in RStudio.
There is one more piece of housekeeping we need to take care of before we get started. Specifically, we need to configure your git so that RStudio can communicate with GitHub. This requires two pieces of information: your email address and your name.
To do so, you will use the use_git_config()
function from the usethis
package.
Type the following lines of code in the console in RStudio filling in your name and email address.
Your email address is the address tied to your GitHub account and your name should be first and last name.
library(usethis)
use_git_config(user.name="your name", user.email="your email")
We will use the following packages in today’s lab.
library(tidyverse)
library(skimr)
library(broom)
library(rcfss)
Currently your project is called Untitled Project. Update the name of your project to be “Lab 02 - Simple Linear Regression”
Pick one team member to complete the steps in this section while the others contribute to the discussion but do not actually touch the files on their computer.
Before we introduce the data, let’s warm up with a simple exercise.
Open the R Markdown (Rmd) file in your project, change the author name to your team name, and knit the document.
Now, the remaining team members who have not been concurrently making these changes on their projects should click on the Pull button in their Git pane and observe that the changes are now reflected on their projects as well.
In today’s lab, we will analyze the scorecard
dataset from the rcfss package. This dataset contains information about 1849 colleges obtained from the Department of Education’s College Scorecard. Load the rcfss library into the global R environment and type ?scorecard
in the console to learn more about the dataset and variable definitions. Today’s analysis will focus on the following variables:
type |
Type of college (Public, Private - nonprofit, Private - for profit) |
cost |
The average annual cost of attendance, including tuition and feeds, books and supplies, and living expenses, minus the average grant/scholarship aid |
admrate |
Undergraduate admissions rate (from 0 - 100%) |
Plot a histogram to examine the distribution of admrate
. What is the shape of the distribution?
To better understand the distribution of admrate
, we would like calculate measures of center and spread of the distribution. Fill in the code below to use the skim
function to calculate summary statistics for admrate
. Report the appropriate measures of center (mean or median) and spread (standard deviation or IQR) based on the shape of the distribution from Exercise 1.
scorecard %>%
select(admrate) %>%
skim()
Plot the distribution of cost
and calculate the appropriate summary statistics. Describe the distribution of cost
(shape, center, and spread) using the plot and appropriate summary statistics.
One nice feature of the skim
function is that it provides information about the number of observations that are missing values of the variable. How many observations have missing values of admrate
? How many observations have missing values of cost
?
Later in the semester, we will techniques to deal with missing values in the data. For now, however, we will only include complete observations for the remainder of this analysis. We can use the filter
function to select only the rows that values for both cost
and admrate
.
Fill in the code below to create a new dataset called scorecard_new
that only includes observations with values for both admrate
and cost
.
__________ <- scorecard %>%
filter(!is.na(admrate),________)
Learn more about the filter
function in [Section 5.2 of R for Data Science] (https://r4ds.had.co.nz/transform.html#filter-rows-with-filter)
You will use scorecard_new
for the rest of the lab.
Create a scatterplot to display the relationship between cost
(response variable) and admrate
(explanatory variable). Use the scatterplot to describe the relationship between the two variables.
The data contains information about the type of college, and we would like to incorporate this information into the scatterplot. One way to do this is to use a different color marker for each type of college. Fill in the code below the scatterplot from the previous exercise with the marker colors based on the variable type
. Describe two new observations from this scatterplot that you didn’t see in the previous plot.
ggplot(data=scorecard_new, mapping=aes(x=admrate, y=cost, color=type)) +
_____________________
Fit a regression model to describe the relationship between a college’s admission rate and cost. Use the tidy
function to display the model.
Interpret the slope in the context of the problem. Does the intercept have a meaningful interpretation? If so, write the interpretation in the context of the problem. Otherwise, explain why the interpretation is not meaningful.
While the tidy
function is used to display the model, we can obtain a one-row summary of the model using the glance
function. Use the glance
function to get a summary of the model fit in the previous exercise. See the documentation for glance
for the syntax and a list of values output from the function.
What is the value of \(R^2\)? Interpret this value in the context of the problem. Do you think this is a “good” value of \(R^2\)? Explain.
What is the value of \(\hat{\sigma}\), the residual standard error.
What is the 95% confidence interval for the coefficient of admrate
, i.e. the slope? Interpret the interval in the context of the data.
We want to test the following hypotheses about the population slope \(\beta_1\):
\[H_0: \beta_1 = 0 \hspace{5mm} \text{versus} \hspace{5mm} H_a: \beta_1 \neq 0\]
State what the null and alternative hypotheses mean in terms of the linear relationship between admrate
and cost
.
You’re done! Commit all remaining changes, use the commit message “Done with Lab 2!”, and push. Before you wrap up the assignment, make sure the .Rmd, .html, and .md documents are all updated on your GitHub repo.