--- title: "dplyr & data wrangling" author: "Colin Rundel" date: "2018-09-12" output: xaringan::moon_reader: css: "slides.css" lib_dir: libs nature: highlightStyle: github highlightLines: true countIncrementalSlides: false --- exclude: true ```{r, message=FALSE, warning=FALSE, include=FALSE} options( htmltools.dir.version = FALSE, # for blogdown width = 80, tibble.width = 80 ) htmltools::tagList(rmarkdown::html_dependency_font_awesome()) library(magrittr) ``` --- class: middle count: false # Pipes --- ## magrittr .middle[ .center[ ```{r echo=FALSE, out.width="60%"} knitr::include_graphics("imgs/magritte.jpg") ``` ```{r echo=FALSE, out.width="60%"} knitr::include_graphics("imgs/magrittr.jpeg") ``` ] ] --- ## Pipes in R You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park.
Expressed as a set of nested functions in R pseudocode this would look like: ```{r eval=FALSE} park(drive(start_car(find("keys")), to="campus")) ```
Writing it out using pipes give it a more natural (and easier to read) structure: ```{r eval=FALSE} find("keys") %>% start_car() %>% drive(to="campus") %>% park() ``` --- ## Approaches All of the following are fine, it comes down to personal preference:
Nested: ```{r, eval=FALSE} h( g( f(x), y=1), z=1 ) ```
Piped: ```{r, eval=FALSE} f(x) %>% g(y=1) %>% h(z=1) ```
Intermediate: ```{r, eval=FALSE} res = f(x) res = g(res, y=1) res = h(res, z=1) ``` --- ## What about other arguments? Sometimes we want to send our results to an function argument other than first one or we want to use the previous result for multiple arguments. In these cases we can refer to the previous result using `.`. -- ```{r} data.frame(a=1:3,b=3:1) %>% lm(a~b,data=.) ``` -- ```{r} data.frame(a=1:3,b=3:1) %>% .[[1]] ``` -- ```{r} data.frame(a=1:3,b=3:1) %>% .[[length(.)]] ``` --- class: middle count: false # dplyr --- ## A Grammar of Data Manipulation dplyr is based on the concepts of functions as verbs that manipulate data frames. Single data frame functions / verbs: .small[ * `filter()`: pick rows matching criteria * `slice()`: pick rows using index(es) * `select()`: pick columns by name * `pull()`: grab a column as a vector * `rename()`: rename specific columns * `arrange()`: reorder rows * `mutate()`: add new variables * `transmute()`: create new data frame with variables * `distinct()`: filter for unique rows * `sample_n()` / `sample_frac()`: randomly sample rows * `summarise()`: reduce variables to values * ... (many more) ] --- ## dplyr rules for functions 1. First argument is *always* a data frame 2. Subsequent arguments say what to do with that data frame 3. *Always* return a data frame 4. Don't modify in place 5. Lazy evaluation magic --- ## Example Data We will demonstrate dplyr's functionality using the nycflights13 data. ```{r message=FALSE} library(dplyr) library(nycflights13) flights ``` --- ## filter() - March flights ```{r} flights %>% filter(month == 3) ``` --- ## filter() - Flights in the first 7 days of March ```{r} flights %>% filter(month == 3, day <= 7) ``` --- ## filter() - Flights to LAX *or* RDU in March ```{r} flights %>% filter(dest == "LAX" | dest == "RDU", month==3) ``` --- ## slice() - First 10 flights ```{r} flights %>% slice(1:10) ``` --- ## slice() - Last 5 flights ```{r} flights %>% slice((n()-4):n()) ``` --- ## select() - Individual Columns ```{r} flights %>% select(year, month, day) ``` --- ## select() - Exclude Columns ```{r} flights %>% select(-year, -month, -day) ``` --- ## select() - Ranges ```{r} flights %>% select(year:day) ``` --- ## select() - Exclusion Ranges ```{r} flights %>% select(-(year:day)) ``` --- class: split-50 ## select() - Matching ```{r} flights %>% select(contains("dep"), contains("arr")) ``` --- ```{r} flights %>% select(starts_with("dep"), starts_with("arr")) ``` Other helpers: `ends_with`, `matches`, `num_range`, `one_of`, `everything`, `last_col`. --- ## select_if() - Get non-numeric columns ```{r} flights %>% select_if(function(x) !is.numeric(x)) ``` --- ## pull() ```{r} names(flights) flights %>% pull("year") %>% head() flights %>% pull(1) %>% head() flights %>% pull(-1) %>% head() ``` --- ## rename() - Change column names ```{r} flights %>% rename(tail_number = tailnum) ``` --- ## select() vs. rename() .small[ ```{r} flights %>% select(tail_number = tailnum) flights %>% rename(tail_number = tailnum) ``` ] --- ## arrange() - Sort data ```{r} flights %>% filter(month==3,day==2) %>% arrange(origin, dest) ``` --- ## arrange() & desc() - Descending order ```{r} flights %>% filter(month==3,day==2) %>% arrange(desc(origin), dest) %>% select(origin, dest, tailnum) ``` --- ## mutate() - Modify columns ```{r message=FALSE} flights %>% select(year:day) %>% mutate(date = paste(year,month,day,sep="/")) ``` --- ## transmute() - Create new tibble from existing columns ```{r} flights %>% select(year:day) %>% transmute(date = paste(year,month,day,sep="/")) ``` --- ## distinct() - Find unique rows ```{r} flights %>% select(origin, dest) %>% distinct() %>% arrange(origin,dest) ``` --- ## Sampling rows .small[ ```{r} flights %>% sample_n(10) flights %>% sample_frac(0.00003) ``` ] --- ## summarise() ```{r} flights %>% summarize(n(), min(dep_delay), max(dep_delay)) ``` -- ```{r} flights %>% summarize( n = n(), min_dep_delay = min(dep_delay, na.rm = TRUE), max_dep_delay = max(dep_delay, na.rm = TRUE) ) ``` --- ## group_by() ```{r} flights %>% group_by(origin) ``` --- ## summarise() with group_by() ```{r} flights %>% group_by(origin) %>% summarize( n = n(), min_dep_delay = min(dep_delay, na.rm = TRUE), max_dep_delay = max(dep_delay, na.rm = TRUE) ) ``` --- ```{r} flights %>% group_by(origin, carrier) %>% summarize( n = n(), min_dep_delay = min(dep_delay, na.rm = TRUE), max_dep_delay = max(dep_delay, na.rm = TRUE) ) %>% filter(n > 10000) ``` --- ## mutate() with group_by() ```{r} flights %>% group_by(origin) %>% mutate( n = n(), ) %>% select(origin, n) ``` --- ## Demos 1. How many flights to Los Angeles (LAX) did each of the legacy carriers (AA, UA, DL or US) have in May from JFK, and what was their average duration?
1. What was the shortest flight out of each airport in terms of distance? In terms of duration? --- ## Exercises 1. Which plane (check the tail number) flew out of each New York airport the most?
1. Which date should you fly on if you want to have the lowest possible average departure delay? What about arrival delay?