class: center, middle, inverse, title-slide # Meet the Toolkit ### Yue Jiang ### Duke University --- class: center, middle ## Reproducible data analysis --- ## Reproducibility checklist .question[ What does it mean for a data analysis to be "reproducible"? ] **Near-term goals:** - Are the tables and figures reproducible from the code and data? - Does the code actually do what you think it does? - In addition to what was done, is it clear **why** it was done? <br> **Long-term goals:** - Can the code be used for other data? - Can you extend the code to do other things? --- ## Toolkit <img src="img/02/toolkit.png" width="70%" style="display: block; margin: auto;" /> - **Scriptability** `\(\rightarrow\)` R - **Literate programming** (code, narrative, output in one place) `\(\rightarrow\)` R Markdown - **Version control** `\(\rightarrow\)` Git / GitHub --- class: center, middle # The toolkit in detail --- ## What is R and RStudio? - R is a statistical programming language - RStudio is a convenient interface for R (an integrated development environment, IDE) - At its simplest:<sup>*</sup> - R is like a car’s engine - RStudio is like a car’s dashboard <img src="img/02/engine-dashboard.png" width="70%" style="display: block; margin: auto;" /> .footnote[ *Source: [Modern Dive](https://moderndive.com/) ] --- ## R essentials (a short list) - **Functions** are (most often) verbs, followed by what they will be applied to in parentheses: ```r do_this(to_this) do_that(to_this, to_that, with_those) ``` - **Columns** (variables) in data frames are accessed with `$`: ```r dataframe$var_name ``` - **Packages** are installed with the `install.packages` function and loaded with the `library` function, once per session: ```r install.packages("package_name") library(package_name) ``` --- ## tidyverse <img src="img/02/tidyverse-packages.png" width="60%" style="display: block; margin: auto;" /> - The [tidyverse](https://www.tidyverse.org/) is an **opinionated** collection of R packages designed for data science. - All packages share an underlying philosophy and a common grammar. .footnote[ Image from [Teaching in the Tidyverse 2020](https://education.rstudio.com/blog/2020/07/teaching-the-tidyverse-in-2020-part-1-getting-started/) ] --- ## R Markdown What is R Markdown? - Fully reproducible reports -- the analysis is run from the beginning each time you knit - [Markdown cheat sheet](https://github.com/rstudio/cheatsheets/raw/master/rmarkdown-2.0.pdf) - Code goes in chunks, defined by three backticks, narrative goes outside of chunks - **Remember**: the workspace of the R Markdown document is *separate* from the Console How will we use R Markdown? - Every assignment / lab / project / etc. is an R Markdown document - You'll always have a template R Markdown document to start with - The amount of scaffolding in the template will decrease over the semester --- ## Why do we need version control? <img src="img/02/phd_comics_vc.gif" width="50%" style="display: block; margin: auto;" /> --- ## What is versioning? <img src="img/02/lego-steps-commit-messages.png" width="80%" style="display: block; margin: auto;" /> --- ## Git and GitHub - **Git** is a version control system -- like “Track Changes” features from Microsoft Word. - **GitHub** is the home for your Git-based projects on the internet (like DropBox but much better). - There are a lot of Git commands and very few people know them all. 99% of the time you will use git to add, commit, push, and pull. <img src="img/02/git-github.png" width="80%" style="display: block; margin: auto;" /> --- ## Git and GitHub tips - We will be doing git things and interfacing with GitHub through RStudio - If you Google for help, skip any methods for using git through the command line. - There is a great resource for working with git and R: [happygitwithr.com](http://happygitwithr.com/). Some of the content in there is beyond the scope of this course, but it's a good place to look for help. --- ## Recap Can you answer these questions? - What is a reproducible data analysis, and why is it important? - What is version control, and why is it important? - What is R vs. RStudio? - What is git vs. GitHub? --- ## The Flint Water Crisis The Flint Water Crisis was a public health emergency that started in 2014. Before class, please read the short article linked [here](https://www2.stat.duke.edu/courses/Fall20/sta199.001/reading/flint-water-story.pdf). <img src="img/02/flint.png" width="80%" style="display: block; margin: auto;" /> Keep the following things in mind from the article: - The "action" level for lead is 15 parts per billion (ppb) - The MDEQ recommended flushing of pipes for at least five minutes --- ## Integrating RStudio with GitHub Each assignment, application exercise, and lab in this course will have an associated link used to create a private **repository** on the class organization page. Today's link is given at [https://classroom.github.com/a/qPa2wG39](https://classroom.github.com/a/qPa2wG39). In class, we will use R Markdown to make a reproducible report regarding the Flint Water Crisis. <img src="img/demo.png" width="40%" style="display: block; margin: auto;" />