August 27, 2015
In May 2015 Science retracted a study of how canvassers can sway people's opinions about gay marriage published just 5 months ago.
Science Editor-in-Chief Marcia McNutt: Original survey data not made available for independent reproduction of results. + Survey incentives misrepresented. + Sponsorship statement false.
Two Berkeley grad students who attempted to replicate the study quickly discovered that the data must have been faked.
Methods we'll discuss today can't prevent this, but they can make it easier to discover issues.
From the authors of Low Dose Lidocaine for Refractory Seizures in Preterm Neonates:
"The article has been retracted at the request of the authors. After carefully re-examining the data presented in the article, they identified that data of two different hospitals got terribly mixed. The published results cannot be reproduced in accordance with scientific and clinical correctness."
The authors informed the journal that the merge of lab results and other survey data used in the paper resulted in an error regarding the identification codes. Results of the analyses were based on the data set in which this error occurred. Further analyses established the results reported in this manuscript and interpretation of the data are not correct.
Original conclusion: Lower levels of CSF IL-6 were associated with current depression and with future depression […].
Revised conclusion: Higher levels of CSF IL-6 and IL-8 were associated with current depression […].
#1 Convince researchers to adopt a reproducible research workflow
#2 Train new researchers who don’t have any other workflow
Scriptability \(\rightarrow\) R
Literate programming \(\rightarrow\) R Markdown
Version control \(\rightarrow\) Git / GitHub
"Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer- what to do, let us concentrate rather on explaining to human beings- what we want a computer to do."
Go to gort.stat.duke.edu:8787
Log on with your Net ID and password
2 + 2
## [1] 4
factorial(20)
## [1] 2.432902e+18
x <- 2 x * 3
## [1] 6
intro_demo
Go to RStudio
Note for the future: Each course component you work on (an application exercise, a homework assignment, project, exam, etc.) should be its own repository, and should be fully contained in a folder inside the folder sta112
.
On GitHub (on the web) edit the README document and Commit
it with a message describing what you did.
As you work in teams you will run into merge conflicts, learning how to resolve them properly will be very important.
Fully reproducible reports
Simple markdown syntax for text
Code goes in chunks
Tip: Keep the Markdown cheat sheet handy, we'll refer to it often as the course progresses.
[Live demo – follow along]
Visualize relationship between life expectancy and GDP per capita in 2007 in countries. Also make a plot
Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data.
In the following exercises we'll use dplyr
(for data wrangling) and ggplot2
(for visualization) packages.
Load these packages in your markdown file
library(dplyr) library(ggplot2)
gapminder <- read.csv("https://stat.duke.edu/~mc301/data/gapminder.csv")
Start with the gapminder
dataset
Filter for cases (rows) where year is equal to 2007
Save this new subsetted dataset as gap07
gap07 <- gapminder %>% filter(year == 2007)
Task: Visualize the relationship between gdpPercap
and lifeExp
.
qplot(x = gdpPercap, y = lifeExp, data = gap07)
Task: Color the points by continent.
qplot(x = gdpPercap, y = lifeExp, color = continent, data = gap07)
Stage
Commit (with a message)
Push
What if you wanted to now change your analysis
to subset for 1952
plot life expectancy (lifeExp
) vs. population (pop
)
gpdPercap
)
size = gpdPercap
to your plotting codeOnce you're done, commit and push all your changes with a meaningful message.