August 31, 2017

Course info

Follow up from last time…

  • Office hours:
    • Dr. Cetinkaya-Rundel: Mondays 1 - 3pm + Wednesdays by appt (213 Old Chem)
    • Kyle: Wed 9 - 10am + Fri 10 - 11am at Old Chem 211A
  • Course website:

Goals

Questions to answer by the end of class

  • What is reproducible data analysis, and why do we care?
  • What is version control, and why do we care?
  • What is R vs RStudio?
  • What is git vs GitHub (and do I need to care)?

Reproducibility

Reproducibility: who cares?

Reproducibility: why should we care?

Two-pronged approach

#1 Convince researchers to adopt a reproducible research workflow



#2 Train new researchers who don’t have any other workflow



Reproducible data analysis

  • Scriptability \(\rightarrow\) R

  • Literate programming \(\rightarrow\) R Markdown

  • Version control \(\rightarrow\) Git / GitHub

Scripting and literate programming

Donald Knuth "Literate Programming (1983)"

"Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer- what to do, let us concentrate rather on explaining to human beings- what we want a computer to do."

  • These ideas have been around for years!
  • and tools for putting them to practice have also been around
  • but they have never been as accessible as the current tools

Reproducibility checklist

Near-term goals:

  • Are the tables and figures reproducible from the code and data?
  • Does the code actually do what you think it does?
  • In addition to what was done, is it clear why it was done? (e.g., how were parameter settings chosen?)

Long-term goals:

  • Can the code be used for other data?
  • Can you extend the code to do other things?

Toolkit

Toolkit

Demo

Logging on to RStudio

Live R/RStudio demo

  • R as a calculator
2 + 2
## [1] 4
factorial(20)
## [1] 2.432902e+18
  • Working with variables
x = 2
x * 3
## [1] 6

Collaboration

How do we collaborate?

  • The statistical programming language we'll use is R

  • The software we use to interface with R is RStudio

  • But how do I get you the course materials that you can build on for your assignments?
    • Hint: I'm not going to email you documents, that would be a mess!

The complete toolkit

GitHub live demo

  • Follow the link to create a repository on GitHub

  • Connect an R project to Github repository

  • Working with a local and remote repository

  • Staging, Committing, Pushing and Pulling

(There is just a bit more of GitHub that we'll use in this class, but for today this is enough.)

Documenting and reporting

R Markdown

  • Fully reproducible reports

  • Simple markdown syntax for text

  • Code goes in chunks

Tip: Keep the R Markdown cheat sheet and Markdown Quick Reference (Help -> Markdown Quick Reference) handy, we'll refer to it often as the course progresses.


[Live demo – follow along]

On to data analysis

Exercises

  • Exercise 0
    • Load any necessary packages and data
  • Exercise 1
    • Visualize the relationship between life expectancy and GDP per capita in 2007 using a scatter plot.
  • Exercise 2
    • Repeat the visualization from Exercise 1, but now color the points by continent.

Step 0: Load necessary packages

Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data.

  • In the following exercises we'll use the readr (for loading data), dplyr (for data wrangling), and ggplot2 (for visualization) packages.

  • To use these packages, we must first load in our markdown file

library(dplyr)
library(ggplot2)
library(readr)

Step 1: Load data

gapminder = read_csv("https://stat.duke.edu/~mc301/data/gapminder.csv")
## Parsed with column specification:
## cols(
##   country = col_character(),
##   continent = col_character(),
##   year = col_integer(),
##   lifeExp = col_double(),
##   pop = col_double(),
##   gdpPercap = col_double()
## )

Step 2: Subset data

  • Start with the gapminder dataset

  • Filter for cases (rows) where year is equal to 2007

  • Save this new subsetted dataset as gap07

gap07 <- gapminder %>%
  filter(year == 2007)

Step 3: Explore and visualize

Task: Visualize the relationship between gdpPercap and lifeExp.

ggplot(data = gap07, aes(x = gdpPercap, y = lifeExp)) + 
  geom_point()

Step 4: Dig deeper

Task: Color the points by continent.

ggplot(data = gap07, aes(x = gdpPercap, y = lifeExp, color = continent)) +
  geom_point()

Update your analysis

What if you wanted to now change your analysis

  • to subset for 1952

  • plot life expectancy (lifeExp) vs. population (pop)

  • and size the points by GPD (gpdPercap)
    • hint: add argument size = gpdPercap to your plotting code

Version control with GitHub

Version control

  • We introduced GitHub as a platform for collaboration

  • But it's much more than that…

  • It's actually desiged for version control

Why version control?

PhD Comics

Why version control?

  • Simple formal system for tracking all changes to a project

  • Time machine for your projects
    • Remove the fear of breaking things
  • Learning curve is a bit steep, but when you need it you REALLY need it



Your closest collaborator is you six months ago, but you don’t reply to emails.

– Paul Wilson, UW-Madison

Recap

Can you answer these questions?

  • What is reproducible data analysis, and why do we care?
  • What is version control, and why do we care?
  • What is R vs RStudio?
  • What is git vs GitHub (and do I need to care)?

Before next class

  • Readings for next Tuesday posted

  • A brief mini homework will be posted tonight, you'll receive a link to create your repo (just like we did today) and complete a short data visualization task using an R Markdown file I provide
    • Having difficulty? Ask on Slack or come to office hours!