August 31, 2017
Seizure study retracted after authors realize data got "terribly mixed"
Two-pronged approach
#1 Convince researchers to adopt a reproducible research workflow
#2 Train new researchers who don’t have any other workflow
Scriptability \(\rightarrow\) R
Literate programming \(\rightarrow\) R Markdown
Version control \(\rightarrow\) Git / GitHub
"Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer- what to do, let us concentrate rather on explaining to human beings- what we want a computer to do."
Near-term goals:
Long-term goals:
Log on with your Net ID and password
2 + 2
## [1] 4
factorial(20)
## [1] 2.432902e+18
x = 2 x * 3
## [1] 6
The statistical programming language we'll use is R
The software we use to interface with R is RStudio
Follow the link to create a repository on GitHub
Connect an R project to Github repository
Working with a local and remote repository
Staging, Committing, Pushing and Pulling
(There is just a bit more of GitHub that we'll use in this class, but for today this is enough.)
Fully reproducible reports
Simple markdown syntax for text
Code goes in chunks
Tip: Keep the R Markdown cheat sheet and Markdown Quick Reference (Help -> Markdown Quick Reference) handy, we'll refer to it often as the course progresses.
[Live demo – follow along]
Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data.
In the following exercises we'll use the readr
(for loading data), dplyr
(for data wrangling), and ggplot2
(for visualization) packages.
To use these packages, we must first load in our markdown file
library(dplyr) library(ggplot2) library(readr)
gapminder = read_csv("https://stat.duke.edu/~mc301/data/gapminder.csv")
## Parsed with column specification: ## cols( ## country = col_character(), ## continent = col_character(), ## year = col_integer(), ## lifeExp = col_double(), ## pop = col_double(), ## gdpPercap = col_double() ## )
Start with the gapminder
dataset
Filter for cases (rows) where year is equal to 2007
Save this new subsetted dataset as gap07
gap07 <- gapminder %>% filter(year == 2007)
Task: Visualize the relationship between gdpPercap
and lifeExp
.
ggplot(data = gap07, aes(x = gdpPercap, y = lifeExp)) + geom_point()
Task: Color the points by continent.
ggplot(data = gap07, aes(x = gdpPercap, y = lifeExp, color = continent)) + geom_point()
What if you wanted to now change your analysis
to subset for 1952
plot life expectancy (lifeExp
) vs. population (pop
)
gpdPercap
)
size = gpdPercap
to your plotting codeWe introduced GitHub as a platform for collaboration
But it's much more than that…
It's actually desiged for version control
Simple formal system for tracking all changes to a project
Learning curve is a bit steep, but when you need it you REALLY need it
Your closest collaborator is you six months ago, but you don’t reply to emails.
– Paul Wilson, UW-Madison
Readings for next Tuesday posted