R is the name of the programming language itself and RStudio is a convenient interface.
The main goal of this lab is to introduce you to R and RStudio, which we will be using throughout the course both to learn the statistical concepts discussed in the course and to analyze real data and come to informed conclusions.
git is a version control system (like “Track Changes” features from Microsoft Word but more powerful) and GitHub is the home for your Git-based projects on the internet (like DropBox but much better).
An additional goal is to introduce you to git and GitHub, which is the collaboration and version control system that we will be using throughout the course.
As the labs progress, you are encouraged to explore beyond what the labs dictate; a willingness to experiment will make you a much better programmer and statistician. If you’re new to R, you should begin by building some basic fluency in R. Today’s lab will focus on fundamental building blocks of R and RStudio: the interface, reading in data, and basic commands. Starting next week, the labs will focus on concepts more specific to regression analysis.
To make versioning simpler, today’s lab is individual. This will give you a chance to become more familiar with git and GitHub. Next week you’ll learn about collaborating on GitHub and will produce a single lab report as a team.
Each of your assignments will begin with the following steps. You saw your TA demonstrate these steps, and they’re outlined in detail here again. Going forward each lab will start with a “Getting started” section but details will be a bit more sparse than this. You can always refer back to this lab for a detailed list of the steps involved for getting started with an assignment.
Copy and paste the URL of your assignment repo into the dialog box. Be sure “Add packages from the base project” is checked.
Click OK, and you should see the contents from your GitHub repo in the Files pane in RStudio.
There is one more piece of housekeeping we need to take care of before we get started. Specifically, we need to configure your git so that RStudio can communicate with GitHub. This requires two pieces of information: your email address and your name.
To do so, you will use the use_git_config()
function from the usethis
package.
Type the following lines of code in the console in RStudio filling in your name and email address.
Your email address is the address tied to your GitHub account and your name should be first and last name.
library(usethis)
use_git_config(user.name="your name", user.email="your email")
For example, mine would be
library(usethis)
use_git_config(user.name="Maria Tackett", user.email="maria.tackett@duke.edu")
If you get the error message
Error in library(usethis) : there is no package called ‘usethis’
then you need to install the usethis
package. Run the following code in the console to install the package.
install.package("usethis")
Once you run the code, your values for user.name
and user.email
will show in the console. If your user.name
and user.email
are correct, you’re good to go! Otherwise, run the code again with the necessary changes.
We will use the following packages in today’s lab.
library(tidyverse)
library(readr)
library(skimr)
library(broom)
If you need to install any of the packages, you can run the code below in the console.
install.packages("tidyverse")
install.packages("readr")
install.packages("skimr")
install.packages("broom")
Before we introduce the data, let’s warm up with some simple exercises.
Currently your project is called Untitled Project. Update the name of your project to be “Lab 01 - Intro R”.
The top portion of your R Markdown file (between the three dashed lines) is called YAML. It stands for “YAML Ain’t Markup Language”. It is a human friendly data serialization standard for all programming languages. All you need to know is that this area is called the YAML (we will refer to it as such) and that it contains meta information about your document.
Open the R Markdown (Rmd) file in your project, change the author name to your name, and knit the document.
Then go to the Git pane in your RStudio.
If you have made changes to your Rmd file, you should see it listed here. Click on it to select it in this list and then click on Diff. This shows you the difference between the last committed state of the document and its current state that includes your changes. If you’re happy with these changes, write “Update author name” in the Commit message box and click Commit.
You don’t have to commit after every change, this would get quite cumbersome. You should consider committing states that are meaningful to you for inspection, comparison, or restoration. In the first few assignments we will tell you exactly when to commit and in some cases, what commit message to use. As the semester progresses we will let you make these decisions.
Now that you have made an update and committed this change, it’s time to push these changes to the web! Or more specifically, to your repo on GitHub. Why? So that others can see your changes. And by others, we mean the course teaching team (your repos in this course are private to you and us, only).
In order to push your changes to GitHub, click on Push. This will prompt a dialogue box where you first need to enter your user name, and then your password. This might feel cumbersome. Bear with me… We will teach you how to save your password so you don’t have to enter it every time. But for this one assignment you’ll have to manually enter each time you push in order to gain some experience with it.
Today’s data comes from the Capital Bikeshare in Washington D.C. The Capital Bikeshare is a system in which customers can rent a bike for little cost, ride it around the city, and return it to a station near their destination. You can get more information about the bikeshare on their website, https://www.capitalbikeshare.com/. We will read in the data from the file bikeshare.csv located in the data folder.
bikeshare <- read_csv("data/bikeshare.csv")
This dataset contains information about the number of bike rentals, environmental conditions, and other information about the each day in 2011 and 2012.
Before doing any analysis, we want to understand the basic structure of the data. One way to do this, is to look at the actual dataset. Type the code below in the console to view the entire dataset.
View(bikeshare)
It is sometimes more useful to view a summary of the data structure rather than view the entire dataset. This is especially true for very large datasets, i.e. those with a very large number of observations and/or rows. We can use the glimpse()
function to get a general idea about the structure of our dataset. This function can be very useful when importing data from a file such as a .csv file (like in this lab) to ensure that data imported correctly and that we have the number of observations (rows) and variables (columns) we expect. We can also use this function to see each variable’s type (e.g. integer, character).
You can type ??glimpse
in the console to learn more about the function and its syntax.
glimpse(bikeshare)
in the console an overview of the bikeshare
dataset.How many observations are in the bikeshare
dataset? How many variables?
season |
1: Winter, 2: Spring, 3: Summer, 4: Fall |
temp |
Temperature (in \(^{\circ}C\)) ÷ 41 |
count |
total number of bike rentals |
Before fitting any regression models, we want to do an exploratory data analysis (EDA) to summarize the main characteristics of the data. Much of the EDA is visual, which we’ll get to in the next exercise. The EDA also consists of calculating summary statistics for the variables in our dataset. It is good practice to examine any variable that may be relevant to the analysis in the EDA, since there may be variables that aren’t directly included in the regression model but are still affecting the results. To keep today’s lab manageable, we will only examine the three variables season
, temp
, and count
.
There are many ways to calculate summary statistics for each variable, and we will use a few of them throughout the semester. For now, let’s use the skim()
function to calculate basic measures of center and spread along with get a sketch of the distribution.
bikeshare %>%
select(season,temp,count) %>%
skim()
What is the mean number of bike rentals? About 25% of the days in the data have a count
above what value?
season
? If so, explain why it makes sense. Otherwise, explain why the skim()
function calculated these summary statistics for the variable season
even if they don’t make sense.This is a good place to pause and commit changes with the commit message “Added summary statistics (Ex 1 - 3)”, and push.
ggplot2
package.The code below is used to create a histogram to visualize the distribution of count
. Modify the code by writing an informative title and label for the x-axis.
ggplot(data=bikeshare, mapping=aes(x=count)) +
geom_histogram() +
labs(title="______", x="______")
First, using a color of your choice, include the option color="_____"
inside of geom_histogram()
function. Your code will look similar this. Be sure to also include an informative title and label for the x-axis.
ggplot(data=bikeshare, mapping=aes(x=count)) +
geom_histogram(color="_______") +
labs(title="______", x="______")
This ggplot2 quick reference has a long list of color options. You can also use HTML color codes.
Next, instead of color="_____"
, use fill="______"
inside of the geom_histogram()
function and put the color of your choice inside the blank. You can use the same color as before or use a new one.
What is the difference in the two plots? In other words, what is the difference in the way color is implemented when using color
versus fill
?
count
. Your description should include comments about the shape, center, spread, and any potential outliers. You should use the histogram and the summary statistics from Exercise 2 in your description.This is a another good place to pause and commit changes with the commit message “Added data visualization of count (Ex 3 - 6)”, and push.
mutate()
function to create a new variable called temp_c
that is calculated as temp * 41
. We will use temp_c
for the remainder of the analysis, so the temperature can be discussed in terms of degrees Celsius.bikeshare <- bikeshare %>%
mutate(temp_c = temp * 41)
Complete the code below to make a scatterplot of the number of bike rentals versus the temperature.
ggplot(data=bikeshare, mapping=aes(x=temp_c,y=count)) +
___________________
https://ggplot2.tidyverse.org/ is a great resource as you learn ggplot()
. Click Reference in the top right corner to see a list of the various plot types available in the ggplot2 package.
Describe the relationship between the temperature and the number of bike rentals.
count
versus temp_c
for each season. To do so, we will use the facet_wrap()
function, faceting by season
. Recall from Exercise 2 that season
is currently stored as an integer. We need to change it to a factor variable type before using it in the facet_wrap()
function (you could also change it to a character variable).bikeshare <- bikeshare %>%
mutate(season = as.factor(season))
ggplot(data=bikeshare, mapping=aes(x=temp_c,y=count)) +
geom_point() +
labs(title = "Number of Bike Rentals vs. Temperature",
subtitle="Faceted by Season",
x = "Temperature (Celsius)",
y = "Number of Bike Rentals") +
facet_wrap(~season)
For which season does the linear relationship between the temperature and the number of bike rentals appear to be the strongest?
This is a another good place to pause and commit changes with the commit message “Added visualization of count vs. temperature (Ex 7 - 8)”, and push.
We want to fit a least-squares regression using the temperature (temp_c
) to explain variation in the number of bike rentals (count
) in the winter season. We can use the filter()
function to create a subset from the data that only includes days during the winter. The <-
assigns the name winter_data
to our subset.
winter_data <- bikeshare %>%
filter(season=="1")
We will use winter_data
for the remainder of the lab.
lm()
function; assign it the name winter_model
. Replace X, Y, and my.data in the code below with the appropriate values.winter_model <- lm(Y ~ X, data=my.data)
tidy(model) #output model
Interpret the slope.
Does it make sense to interpret the intercept? If so, write the interpretation of the intercept. Otherwise, explain why not.
mutate()
function to add a new variable called resid
that is the residual for each observation in winter_data
data.winter_data <- winter_data %>%
mutate(resid = residuals(winter_model))
The code for the residuals vs. the predictor variable and the Normal QQ plot are below. In addition to these plots, write the code to make a histogram of the residuals. You can reuse code from a previous exercise to plot the histogram.
ggplot(data=winter_data, mapping=aes(x=temp_c,y=resid)) +
geom_point() +
geom_hline(yintercept=0,color="red")+
labs(title="Residuals vs. Temperature",
x="Temperature",
y="Residuals")
ggplot(data=winter_data, mapping=aes(sample=resid)) +
stat_qq() +
stat_qq_line()+
labs(title="Normal QQ Plot of Residuals")
Based on the plots of the residuals and the scatterplot, are linearity, normality, and non-constant variance assumptions met? Briefly explain.
Is the independence assumption met based on the description of the data? Briefly explain.
Optional: Create a plot that could be used to help you assess the independence assumption.
Throughout the semester, we will learn various methods to deal with any violations in regression assumptions. For now, we will just note them.
You’re done! Commit all remaining changes, use the commit message “Done with Lab 1!”, and push. Before you wrap up the assignment, make sure all documents are updated on your GitHub repo.