Getting Started

Each of your assignments will begin with the following steps. You saw your TA demonstrate these steps, and they’re outlined in detail here again. Going forward each lab will start with a “Getting started” section but details will be a bit more sparse than this. You can always refer back to this lab for a detailed list of the steps involved for getting started with an assignment.

Clone Assignment Repo

Go to the STA210-Sp19 organization on GitHub (https://github.com/STA210-Sp19). Click on the repo with the prefix lab-01-intro-r. It contains the starter documents you need to complete the lab.

Click on the green Clone or download button, select Use HTTPS (this might already be selected by default, and if it is, you’ll see the text Clone with HTTPS as in the image below). Click on the clipboard icon to copy the repo URL.

Go to RStudio Cloud and into the STA 210 course workspace. Create a New Project from Git Repo. You will need to click on the down arrow next to the New Project button to see this option.

Copy and paste the URL of your assignment repo into the dialog box. Be sure “Add packages from the base project” is checked.
Click OK, and you should see the contents from your GitHub repo in the Files pane in RStudio.

Configure git

There is one more piece of housekeeping we need to take care of before we get started. Specifically, we need to configure your git so that RStudio can communicate with GitHub. This requires two pieces of information: your email address and your name.

To do so, you will use the use_git_config() function from the usethis package.

Type the following lines of code in the console in RStudio filling in your name and email address.

⊕Your email address is the address tied to your GitHub account and your name should be first and last name.

library(usethis)
use_git_config(user.name="your name", user.email="your email")

For example, mine would be

library(usethis)
use_git_config(user.name="Maria Tackett", user.email="maria.tackett@duke.edu")

If you get the error message

Error in library(usethis) : there is no package called ‘usethis’

then you need to install the usethis package. Run the following code in the console to install the package.

install.package("usethis")

Once you run the code, your values for user.name and user.email will show in the console. If your user.name and user.email are correct, you’re good to go! Otherwise, run the code again with the necessary changes.

Packages

We will use the following packages in today’s lab.

library(tidyverse)
library(readr)
library(skimr)
library(broom)

If you need to install any of the packages, you can run the code below in the console.

install.packages("tidyverse") 
install.packages("readr")
install.packages("skimr")
install.packages("broom")

Warm up

Before we introduce the data, let’s warm up with some simple exercises.

Project name:

Currently your project is called Untitled Project. Update the name of your project to be “Lab 01 - Intro R”.

⊕The top portion of your R Markdown file (between the three dashed lines) is called YAML. It stands for “YAML Ain’t Markup Language”. It is a human friendly data serialization standard for all programming languages. All you need to know is that this area is called the YAML (we will refer to it as such) and that it contains meta information about your document.

YAML:

Open the R Markdown (Rmd) file in your project, change the author name to your name, and knit the document.

Commiting changes:

Then go to the Git pane in your RStudio.

If you have made changes to your Rmd file, you should see it listed here. Click on it to select it in this list and then click on Diff. This shows you the difference between the last committed state of the document and its current state that includes your changes. If you’re happy with these changes, write “Update author name” in the Commit message box and click Commit.

You don’t have to commit after every change, this would get quite cumbersome. You should consider committing states that are meaningful to you for inspection, comparison, or restoration. In the first few assignments we will tell you exactly when to commit and in some cases, what commit message to use. As the semester progresses we will let you make these decisions.

Pushing changes:

Now that you have made an update and committed this change, it’s time to push these changes to the web! Or more specifically, to your repo on GitHub. Why? So that others can see your changes. And by others, we mean the course teaching team (your repos in this course are private to you and us, only).

In order to push your changes to GitHub, click on Push. This will prompt a dialogue box where you first need to enter your user name, and then your password. This might feel cumbersome. Bear with me… We will teach you how to save your password so you don’t have to enter it every time. But for this one assignment you’ll have to manually enter each time you push in order to gain some experience with it.

Data

Today’s data comes from the Capital Bikeshare in Washington D.C. The Capital Bikeshare is a system in which customers can rent a bike for little cost, ride it around the city, and return it to a station near their destination. You can get more information about the bikeshare on their website, https://www.capitalbikeshare.com/. We will read in the data from the file bikeshare.csv located in the data folder.

bikeshare <- read_csv("data/bikeshare.csv")

This dataset contains information about the number of bike rentals, environmental conditions, and other information about the each day in 2011 and 2012.

Exercises

Before doing any analysis, we want to understand the basic structure of the data. One way to do this, is to look at the actual dataset. Type the code below in the console to view the entire dataset.

View(bikeshare)

It is sometimes more useful to view a summary of the data structure rather than view the entire dataset. This is especially true for very large datasets, i.e. those with a very large number of observations and/or rows. We can use the glimpse() function to get a general idea about the structure of our dataset. This function can be very useful when importing data from a file such as a .csv file (like in this lab) to ensure that data imported correctly and that we have the number of observations (rows) and variables (columns) we expect. We can also use this function to see each variable’s type (e.g. integer, character).

⊕You can type ??glimpse in the console to learn more about the function and its syntax.

Type glimpse(bikeshare) in the console an overview of the bikeshare dataset.

How many observations are in the bikeshare dataset? How many variables?

In this lab, we will focus the analysis on the following variables:

`season`	1: Winter, 2: Spring, 3: Summer, 4: Fall
`temp`	Temperature (in \(^{\circ}C\)) ÷ 41
`count`	total number of bike rentals

Before fitting any regression models, we want to do an exploratory data analysis (EDA) to summarize the main characteristics of the data. Much of the EDA is visual, which we’ll get to in the next exercise. The EDA also consists of calculating summary statistics for the variables in our dataset. It is good practice to examine any variable that may be relevant to the analysis in the EDA, since there may be variables that aren’t directly included in the regression model but are still affecting the results. To keep today’s lab manageable, we will only examine the three variables season, temp, and count.

There are many ways to calculate summary statistics for each variable, and we will use a few of them throughout the semester. For now, let’s use the skim() function to calculate basic measures of center and spread along with get a sketch of the distribution.

bikeshare %>%
  select(season,temp,count) %>%
  skim()

What is the mean number of bike rentals? About 25% of the days in the data have a count above what value?

Does it make sense to calculate measures of center and spread for the variable season? If so, explain why it makes sense. Otherwise, explain why the skim() function calculated these summary statistics for the variable season even if they don’t make sense.

This is a good place to pause and commit changes with the commit message “Added summary statistics (Ex 1 - 3)”, and push.

Visualizing Your Data

One important part of EDA is visualizing the data to get a better idea of the shape of the distribution of each variable along with the relationship between variables. There are a lot of ways to make plots in R; we will use the functions available in the ggplot2 package.

The code below is used to create a histogram to visualize the distribution of count. Modify the code by writing an informative title and label for the x-axis.

ggplot(data=bikeshare, mapping=aes(x=count)) + 
  geom_histogram() +
  labs(title="______", x="______")

Sometimes you may want to customize a plot by changing different features such as the color, marker types, etc. When plotting a histogram, one easy way to customize it is by changing the color the bars. We’ll look at two different ways to do this.

First, using a color of your choice, include the option color="_____" inside of geom_histogram() function. Your code will look similar this. Be sure to also include an informative title and label for the x-axis.

ggplot(data=bikeshare, mapping=aes(x=count)) + 
  geom_histogram(color="_______") +
  labs(title="______", x="______")

⊕This ggplot2 quick reference has a long list of color options. You can also use HTML color codes.

Next, instead of color="_____", use fill="______" inside of the geom_histogram() function and put the color of your choice inside the blank. You can use the same color as before or use a new one.

What is the difference in the two plots? In other words, what is the difference in the way color is implemented when using color versus fill?

Describe the distribution of count. Your description should include comments about the shape, center, spread, and any potential outliers. You should use the histogram and the summary statistics from Exercise 2 in your description.

This is a another good place to pause and commit changes with the commit message “Added data visualization of count (Ex 3 - 6)”, and push.

Now that we’ve examined the variables individually, we want to look at the relationship between the variables. To make interpretation easier, we will use the mutate() function to create a new variable called temp_c that is calculated as temp * 41. We will use temp_c for the remainder of the analysis, so the temperature can be discussed in terms of degrees Celsius.

bikeshare <- bikeshare %>%
  mutate(temp_c = temp * 41)

Complete the code below to make a scatterplot of the number of bike rentals versus the temperature.

ggplot(data=bikeshare, mapping=aes(x=temp_c,y=count)) +
  ___________________

⊕https://ggplot2.tidyverse.org/ is a great resource as you learn ggplot(). Click Reference in the top right corner to see a list of the various plot types available in the ggplot2 package.

Describe the relationship between the temperature and the number of bike rentals.

The temperature and number of bike rentals varies greatly depending on the season. Therefore, we would like to create a separate scatterplot of count versus temp_c for each season. To do so, we will use the facet_wrap() function, faceting by season. Recall from Exercise 2 that season is currently stored as an integer. We need to change it to a factor variable type before using it in the facet_wrap() function (you could also change it to a character variable).

bikeshare <- bikeshare %>%
  mutate(season = as.factor(season))

ggplot(data=bikeshare, mapping=aes(x=temp_c,y=count)) + 
  geom_point() +
  labs(title = "Number of Bike Rentals vs. Temperature", 
       subtitle="Faceted by Season", 
       x = "Temperature (Celsius)", 
       y = "Number of Bike Rentals") +
  facet_wrap(~season)

For which season does the linear relationship between the temperature and the number of bike rentals appear to be the strongest?

This is a another good place to pause and commit changes with the commit message “Added visualization of count vs. temperature (Ex 7 - 8)”, and push.

Simple Linear Regression

We want to fit a least-squares regression using the temperature (temp_c) to explain variation in the number of bike rentals (count) in the winter season. We can use the filter() function to create a subset from the data that only includes days during the winter. The <- assigns the name winter_data to our subset.

winter_data <- bikeshare %>%
  filter(season=="1")

We will use winter_data for the remainder of the lab.

Fit a simple linear regression model with using the lm() function; assign it the name winter_model. Replace X, Y, and my.data in the code below with the appropriate values.

winter_model <- lm(Y ~ X, data=my.data)
tidy(model) #output model

Interpret the slope.

Does it make sense to interpret the intercept? If so, write the interpretation of the intercept. Otherwise, explain why not.

We conclude by checking the assumptions for regression. We use the mutate() function to add a new variable called resid that is the residual for each observation in winter_data data.

winter_data <- winter_data %>%
  mutate(resid = residuals(winter_model))

The code for the residuals vs. the predictor variable and the Normal QQ plot are below. In addition to these plots, write the code to make a histogram of the residuals. You can reuse code from a previous exercise to plot the histogram.

ggplot(data=winter_data, mapping=aes(x=temp_c,y=resid)) +
  geom_point() + 
  geom_hline(yintercept=0,color="red")+
  labs(title="Residuals vs. Temperature",
       x="Temperature", 
       y="Residuals")

ggplot(data=winter_data, mapping=aes(sample=resid)) +
  stat_qq() + 
  stat_qq_line()+
  labs(title="Normal QQ Plot of Residuals")

Based on the plots of the residuals and the scatterplot, are linearity, normality, and non-constant variance assumptions met? Briefly explain.

Is the independence assumption met based on the description of the data? Briefly explain.

Optional: Create a plot that could be used to help you assess the independence assumption.

Throughout the semester, we will learn various methods to deal with any violations in regression assumptions. For now, we will just note them.

You’re done! Commit all remaining changes, use the commit message “Done with Lab 1!”, and push. Before you wrap up the assignment, make sure all documents are updated on your GitHub repo.

Lab 01: Introduction to R

due Wed, Jan 23 at 11:59p

Introduction

Topics covered in this lab: