Lab 02: Simple Linear Regression

due Wed, Jan 30 at 11:59p

The primary goal of today’s lab is to give you practice with some of the tools you will need to conduct regression analysis using R. An additional goal for today is for you to be introduced to your teams and practice collaborating using GitHub and RStudio.

Getting Started

Each of your assignments will begin with the following steps.

Clone Assignment Repo

Configure git

There is one more piece of housekeeping we need to take care of before we get started. Specifically, we need to configure your git so that RStudio can communicate with GitHub. This requires two pieces of information: your email address and your name.

To do so, you will use the use_git_config() function from the usethis package.

Type the following lines of code in the console in RStudio filling in your name and email address.

Your email address is the address tied to your GitHub account and your name should be first and last name.

library(usethis)
use_git_config(user.name="your name", user.email="your email")

Packages

We will use the following packages in today’s lab.

library(tidyverse)
library(skimr)
library(broom)
library(rcfss)

Project name:

Currently your project is called Untitled Project. Update the name of your project to be “Lab 02 - Simple Linear Regression”

Warm up

Pick one team member to complete the steps in this section while the others contribute to the discussion but do not actually touch the files on their computer.

Before we introduce the data, let’s warm up with a simple exercise.

YAML:

Open the R Markdown (Rmd) file in your project, change the author name to your team name, and knit the document.

Commiting and pushing changes:

Pulling changes:

Now, the remaining team members who have not been concurrently making these changes on their projects should click on the Pull button in their Git pane and observe that the changes are now reflected on their projects as well.

Data

In today’s lab, we will analyze the scorecard dataset from the rcfss package. This dataset contains information about 1849 colleges obtained from the Department of Education’s College Scorecard. Load the rcfss library into the global R environment and type ?scorecard in the console to learn more about the dataset and variable definitions. Today’s analysis will focus on the following variables:

type Type of college (Public, Private - nonprofit, Private - for profit)
cost The average annual cost of attendance, including tuition and feeds, books and supplies, and living expenses, minus the average grant/scholarship aid
admrate Undergraduate admissions rate (from 0 - 100%)

Exercises

Exploratory Data Analysis

  1. Plot a histogram to examine the distribution of admrate. What is the shape of the distribution?

  2. To better understand the distribution of admrate, we would like calculate measures of center and spread of the distribution. Fill in the code below to use the skim function to calculate summary statistics for admrate. Report the appropriate measures of center (mean or median) and spread (standard deviation or IQR) based on the shape of the distribution from Exercise 1.

scorecard %>%
  select(admrate) %>%
  skim()
  1. Plot the distribution of cost and calculate the appropriate summary statistics. Describe the distribution of cost (shape, center, and spread) using the plot and appropriate summary statistics.

  2. One nice feature of the skim function is that it provides information about the number of observations that are missing values of the variable. How many observations have missing values of admrate? How many observations have missing values of cost?

  3. Later in the semester, we will techniques to deal with missing values in the data. For now, however, we will only include complete observations for the remainder of this analysis. We can use the filter function to select only the rows that values for both cost and admrate.

Fill in the code below to create a new dataset called scorecard_new that only includes observations with values for both admrate and cost.

__________ <- scorecard %>%
  filter(!is.na(admrate),________)

Learn more about the filter function in [Section 5.2 of R for Data Science] (https://r4ds.had.co.nz/transform.html#filter-rows-with-filter)

You will use scorecard_new for the rest of the lab.

  1. Create a scatterplot to display the relationship between cost (response variable) and admrate (explanatory variable). Use the scatterplot to describe the relationship between the two variables.

  2. The data contains information about the type of college, and we would like to incorporate this information into the scatterplot. One way to do this is to use a different color marker for each type of college. Fill in the code below the scatterplot from the previous exercise with the marker colors based on the variable type. Describe two new observations from this scatterplot that you didn’t see in the previous plot.

ggplot(data=scorecard_new, mapping=aes(x=admrate, y=cost, color=type)) + 
  _____________________

Simple Linear Regression

  1. Fit a regression model to describe the relationship between a college’s admission rate and cost. Use the tidy function to display the model.

  2. Interpret the slope in the context of the problem. Does the intercept have a meaningful interpretation? If so, write the interpretation in the context of the problem. Otherwise, explain why the interpretation is not meaningful.

  3. While the tidy function is used to display the model, we can obtain a one-row summary of the model using the glance function. Use the glance function to get a summary of the model fit in the previous exercise. See the documentation for glance for the syntax and a list of values output from the function.

  4. What is the value of \(R^2\)? Interpret this value in the context of the problem. Do you think this is a “good” value of \(R^2\)? Explain.

  5. What is the value of \(\hat{\sigma}\), the residual standard error.

  6. What is the 95% confidence interval for the coefficient of admrate, i.e. the slope? Interpret the interval in the context of the data.

  7. We want to test the following hypotheses about the population slope \(\beta_1\):

\[H_0: \beta_1 = 0 \hspace{5mm} \text{versus} \hspace{5mm} H_a: \beta_1 \neq 0\]

State what the null and alternative hypotheses mean in terms of the linear relationship between admrate and cost.

  1. Consider the confidence interval from Exercise 13 and the hypotheses in Exercise 14. Is the confidence interval consistent with the null or alternative hypothesis? Briefly explain.

You’re done! Commit all remaining changes, use the commit message “Done with Lab 2!”, and push. Before you wrap up the assignment, make sure the .Rmd, .html, and .md documents are all updated on your GitHub repo.