August 30, 2016

Syllabus

General info

Grading

  • Grading breakdown:
    • Participation - 10%
    • Application Exercises - 10%
    • Homework - 40%
    • Midterm 1 - 15%
    • Midterm 2 - 15%
    • Final project - 10%
  • Class attendance is a firm expectation; frequent absences or tardiness will be considered a legitimate cause for grade reduction.

  • Exact ranges for letter grades will be curved and cutoffs will be determined after the final exam.

  • The more evidence there is that the class has mastered the material, the more generous the curve will be.

Class meetings

  • Interactive

  • Learn-by-doing

  • Bring your laptop to class every day

Teams

  • Short survey to gage your previous exposure to material relevant to the course.

  • Teams of 3-5 students for in-class activities, homeworks, and project.

Project

  • Larger computational tasks towards the end of the semester

  • Present results / work product to the class

  • Collaborative / fully reproducible work

  • Synthesis of what you've been taught, but should focus on a specific area

Midterm exams

  • Two take home midterm exams that you are expected to complete individually.

  • Complete a number of computational / analysis tasks that cover the breadth of the material presented in the class.

Academic integrity

Duke Community Standard:

  • I will not lie, cheat, or steal in my academic endeavors;

  • I will conduct myself honorably in all my endeavors; and

  • I will act if the Standard is compromised.

Reusing / sharing code

  • A huge amount of code is available on the web with solutions to any number of problems.

  • Unless I explicitly tell you not to, you may use these resource. In general, the course's policy is that you may make use of these resources (e.g. StackOverflow) but you must explicitly cite where any outside code was obtained.

  • Any recycled coded that is discovered and is not explicitly cited will be treated as plagiarism.

  • The one exception to this rule is that you may not directly share code with another team or student in this class, you are welcome to discuss the problems together and ask for advice (unless explicitly told not to), but you may not send or make use of code from any one else in this class.

Excused absences

  • Students who miss graded work due to a scheduled varsity trip, religious holiday or short-term illness should fill out an online NOVAP, RHoliday or short-term illness form respectively.

  • If you cannot complete an assignment on the due date due to a short-term illness, you have until noon the following day to complete it at no penalty, then the regular late work policy kicks in.

  • If you are faced with a personal or family emergency or a long-range or chronic health condition that interferes with your ability to attend or complete classes, you should contact your academic dean's office. See more information on policies surrounding these conditions at https://trinity.duke.edu/undergraduate/academic-policies/personal-emergencies. Your academic dean can also provide more information.

Late work policy

  • late, but same day: -10%

  • late, next day: -20%

  • 2 days or later: no credit

Other Policies

  • Please refrain from texting or using your computer for anything other than coursework during class

  • You must be in class on a day when you're scheduled to present, there are no make ups for presentations

  • Regrade requests must be made within 3 days of when the assignment is returned, and must be submitted in writing

  • Use of disallowed materials during the take home exam will not be tolerated

Reproducibility: who cares?

Science retracts gay marriage paper without agreement of lead author LaCour

  • In May 2015 Science retracted a study of how canvassers can sway people's opinions about gay marriage published just 5 months ago.

  • Science Editor-in-Chief Marcia McNutt: Original survey data not made available for independent reproduction of results. + Survey incentives misrepresented. + Sponsorship statement false.

  • Two Berkeley grad students who attempted to replicate the study quickly discovered that the data must have been faked.

  • Methods we'll discuss today can't prevent this, but they can make it easier to discover issues.

Source: http://news.sciencemag.org/policy/2015/05/science-retracts-gay-marriage-paper-without-lead-author-s-consent

Seizure study retracted after authors realize data got "terribly mixed"

From the authors of Low Dose Lidocaine for Refractory Seizures in Preterm Neonates:

"The article has been retracted at the request of the authors. After carefully re-examining the data presented in the article, they identified that data of two different hospitals got terribly mixed. The published results cannot be reproduced in accordance with scientific and clinical correctness."

Source: http://retractionwatch.com/2013/02/01/seizure-study-retracted-after-authors-realize-data-got-terribly-mixed/

Bad spreadsheet merge kills depression paper, quick fix resurrects it

  • The authors informed the journal that the merge of lab results and other survey data used in the paper resulted in an error regarding the identification codes. Results of the analyses were based on the data set in which this error occurred. Further analyses established the results reported in this manuscript and interpretation of the data are not correct.

  • Original conclusion: Lower levels of CSF IL-6 were associated with current depression and with future depression […].

  • Revised conclusion: Higher levels of CSF IL-6 and IL-8 were associated with current depression […].

Source: http://retractionwatch.com/2014/07/01/bad-spreadsheet-merge-kills-depression-paper-quick-fix-resurrects-it/

Reproducibility: why should we care?

Two-pronged approach

#1 Convince researchers to adopt a reproducible research workflow



#2 Train new researchers who don’t have any other workflow



Reproducible data analysis

  • Scriptability \(\rightarrow\) R

  • Literate programming \(\rightarrow\) R Markdown

  • Version control \(\rightarrow\) Git / GitHub

Scripting and literate programming

Donald Knuth "Literate Programming (1983)"

"Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer- what to do, let us concentrate rather on explaining to human beings- what we want a computer to do."

  • These ideas have been around for years!
  • and tools for putting them to practice have also been around
  • but they have never been as accessible as the current tools

Reproducibility checklist

  • Are the tables and figures reproducible from the code and data?
  • Does the code actually do what you think it does?
  • In addition to what was done, is it clear why it was done? (e.g., how were parameter settings chosen?)
  • Can the code be used for other data?
  • Can you extend the code to do other things?

Toolkit

Toolkit

Demo

Logging on to RStudio

Live R/RStudio demo

  • R as a calculator
2 + 2
## [1] 4
factorial(20)
## [1] 2.432902e+18
  • Working with variables
x = 2
x * 3
## [1] 6

Documenting and reporting

R Markdown

  • Fully reproducible reports

  • Simple markdown syntax for text

  • Code goes in chunks

Tip: Keep the Markdown cheat sheet handy, we'll refer to it often as the course progresses.


[Live demo – follow along]

On to data analysis

Exercises

  • Exercise 0
    • Load any necessary packages and data
  • Exercise 1
    • Visualize the relationship between life expectancy and GDP per capita in 2007 using a scatter plot.
  • Exercise 2
    • Repeat the visualization from Exercise 1, but now color the points by continent.

Step 0: Load necessary packages

Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data.

  • In the following exercises we'll use the dplyr (for data wrangling) and ggplot2 (for visualization) packages.

  • To use these packages, we must first load in our markdown file

library(dplyr)
library(ggplot2)

Step 1: Load data

gapminder = read.csv("https://stat.duke.edu/~mc301/data/gapminder.csv")

Step 2: Subset data

  • Start with the gapminder dataset

  • Filter for cases (rows) where year is equal to 2007

  • Save this new subsetted dataset as gap07

gap07 = filter(gapminder, year == 2007)

Step 3: Explore and visualize

Task: Visualize the relationship between gdpPercap and lifeExp.

ggplot(data = gap07, aes(x = gdpPercap, y = lifeExp)) + geom_point()

Step 4: Dig deeper

Task: Color the points by continent.

ggplot(data = gap07, aes(x = gdpPercap, y = lifeExp, color = continent)) + geom_point()

Update your analysis

What if you wanted to now change your analysis

  • to subset for 1952

  • plot life expectancy (lifeExp) vs. population (pop)

  • and size the points by GPD (gpdPercap)
    • hint: add argument size = gpdPercap to your plotting code

Homework

Homework