--- title: "Package dplyr" subtitle: "Statistical Computing & Programming" author: "Shawn Santo" institute: "" date: "05-21-20" output: xaringan::moon_reader: css: "slides.css" lib_dir: libs nature: highlightStyle: github highlightLines: true countIncrementalSlides: false editor_options: chunk_output_type: console --- ```{r include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, comment = "#>", highlight = TRUE, fig.align = "center") ``` ## Supplementary materials Companion videos - [Introduction to `tidyverse`, `dplyr` and the pipe operator](https://warpwire.duke.edu/w/0cgDAA/) - [Core `dplyr` functions](https://warpwire.duke.edu/w/08gDAA/) - [More `dplyr`: `summarise()`, `group_by()`, and `case_when()`](https://warpwire.duke.edu/w/1cgDAA/) Additional resources - `dplyr` [cheat sheet](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf) - [dplyr vignette](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html) - [Chapter 5](https://r4ds.had.co.nz/transform.html), R for Data Science --- ## Getting started ```{r} library(tidyverse) ``` ```r ── Attaching packages ─────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ── ✔ ggplot2 3.2.1 ✔ purrr 0.3.2 ✔ tibble 2.1.3 ✔ dplyr 0.8.3 ✔ tidyr 0.8.3 ✔ stringr 1.4.0 ✔ readr 1.3.1 ✔ forcats 0.4.0 ── Conflicts ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ``` Also, load `nycflights13`. ```{r} library(nycflights13) ``` --- class: inverse, center, middle # Pipes --- ## Pipes in R Infix function `%>%` is a forward-pipe operator. It allows you to pipe an object forward into a function or call expression. You can think about the following sequence of actions - find keys, unlock car, start car, drive to school, park.
Expressed as a set of nested functions in R pseudo code this would look like: ```{r eval=FALSE} park(drive(start_car(unlock_car(find("keys"))), to = "campus")) ```
Writing it out using pipes give it a more natural (and easier to read) structure: ```{r eval=FALSE} find("keys") %>% unlock_car() %>% start_car() %>% drive(to = "campus") %>% park() ``` --- ## Approaches All of the following are fine, it comes down to personal preference. Nested: ```{r, eval=FALSE} h(g(f(x), y = 1), z = 1) ``` Piped: ```{r, eval=FALSE} f(x) %>% g(y = 1) %>% h(z = 1) ``` Intermediate: ```{r, eval=FALSE} res <- f(x) res <- g(res, y = 1) res <- h(res, z = 1) ``` --- ## What about other arguments? By default, the object on the left-hand side of ` %>% ` is placed as the value to the first argument in the function on the right-hand side of ` %>% `. To pass the value to other arguments a `.` is used. For example, -- ```{r} data.frame(a = 1:3, b = 3:1) %>% lm(a ~ b, data = .) ``` -- ```{r} data.frame(a = 1:3, b = 3:1) %>% .[[1]] ``` -- ```{r} data.frame(a = 1:3, b = 3:1) %>% .[[length(.)]] ``` --- class: inverse, center, middle # Data wrangling with `dplyr` --- ## A grammar of data manipulation Package `dplyr` is based on the concepts of functions as verbs that manipulate data frames. Single data frame functions / verbs: .small-text[ | Function | Description | |--------------------------------|--------------------------------------| | `filter()` | pick rows matching criteria | | `slice()` | pick rows using indices | | `select()` | pick columns by name | | `pull()` | grab a column as a vector | | `rename()` | rename specific columns | | `arrange()` | reorder rows | | `mutate()` | add new variables | | `transmute()` | create new data frame with variables | | `distinct()` | filter for unique rows | | `sample_n()` / `sample_frac()` | randomly sample rows | | `summarise()` | reduce variables to values | ] ... many more --- ## `dplyr` rules 1. First argument is *always* a data frame 2. Subsequent arguments say what to do with that data frame 3. Almost always returns a data frame 4. Doesn't modify in place

Based on rules 1 and 3, it is natural to apply ` %>% ` in a sequence of `dplyr` functions for data wrangling purposes. --- ## Data We will demonstrate `dplyr`'s functionality using the `nycflights13` package. ```{r} flights ``` --- ## `filter()` - March flights ```{r} flights %>% filter(month == 3) ``` --- ## `filter()` - flights in the first 7 days of March ```{r} flights %>% filter(month == 3, day <= 7) ``` --- ## `filter()` - flights to LAX *or* RDU in March ```{r} flights %>% filter(dest == "LAX" | dest == "RDU", month == 3) ``` --- ## `slice()` - first 10 flights ```{r} flights %>% slice(1:10) ``` --- ## `slice()` - last 5 flights ```{r} flights %>% slice((n() - 4):n()) ``` --- ## `select()` - specific variables ```{r} flights %>% select(year, month, day) ``` --- ## `select()` - exclude variables ```{r} flights %>% select(-year, -month, -day) ``` --- ## `select()` - ranges ```{r} flights %>% select(year:day) ``` --- ## `select()` - exclude ranges ```{r} flights %>% select(-(year:day)) ``` --- ## `select()` - matching ```{r} flights %>% select(contains("dep"), contains("arr")) ``` --- ```{r} flights %>% select(starts_with("dep"), starts_with("arr")) ``` Other helpers: `ends_with()`, `matches()`, `num_range()`, `one_of()`, `everything()`, `last_col()` --- ## `select_if()` ```{r} flights %>% select_if(function(x) !is.numeric(x)) ``` --- ## `pull()` - grab a vector ```{r} names(flights) flights %>% pull("year") %>% head() ``` -- ```{r} flights %>% pull(1) %>% head ``` -- ```{r} flights %>% pull(-1) %>% .[1:4] ``` --- ## `arrange()` - sort data ```{r} flights %>% filter(month == 3, day == 2) %>% arrange(origin, dest) ``` --- ## `arrange()` & `desc()` By default, sorting is done in ascending order. To change that, use `desc()`. ```{r} flights %>% filter(month == 3, day == 2) %>% arrange(desc(origin), dest) %>% select(origin, dest, tailnum) ``` --- ## `mutate()` - modify columns ```{r} flights %>% select(year:day) %>% mutate(date = paste(year, month, day, sep = "/")) ``` --- ## `transmute()` - create tibble from existing columns ```{r} flights %>% transmute(date = paste(year, month, day, sep = "/")) ``` --- ## `distinct()` - find unique rows ```{r} flights %>% select(origin, dest) %>% distinct() %>% arrange(origin, dest) ``` --- ## Sampling rows .pull-left[ `sample_n()` ```{r} flights %>% select(year, origin) %>% sample_n(10) ``` ] .pull-right[ `sample_frac()` ```{r} flights %>% select(year, origin) %>% sample_frac(.00001) ``` ] --- class: inverse, center, middle # Exercises --- ## Data: Wake county parcels Parcel boundaries with address and revenue-related information for properties in Wake County, NC. http://data-wake.opendata.arcgis.com/datasets/parcels ```{r} wake <- read_csv("https://www2.stat.duke.edu/~sms185/data/econ/parcels.csv") ``` --- .tiny[ ```{r} wake <- janitor::clean_names(wake) wake ``` ] --- ## Tasks 1. Which city has the fewest land parcels in the dataset?

2. Create a tibble that shows the year a parcel was built and the total value, where all parcels are located in Apex and are more than one acre in area. Sort the result in ascending order by year built.

3. Choose a subset of five variables and 10 random rows from `wake` and save it as an object named `wake_mini`. Experiment renaming variables with `select()` and `rename()` on `wake_mini`. What is the difference between the two functions? ??? ## Solutions .solution[ ```{r} wake %>% count(city_decode) %>% arrange(n) %>% slice(1) wake %>% filter(city_decode == "APEX", calc_area > 1) %>% select(year_built, total_value_assd) %>% arrange(year_built) ``` - Function `select()` only keeps the variables you mention. - Function `rename()` keeps all variables. ] --- ## `summarise()` ```{r} flights %>% summarize(n(), min(dep_delay), max(dep_delay)) ``` -- ```{r} flights %>% summarize( n = n(), min_dep_delay = min(dep_delay, na.rm = TRUE), max_dep_delay = max(dep_delay, na.rm = TRUE) ) ``` --- ## Useful `summarise()` functions - Center: `mean()`, `median()` - Spread: `sd()`, `IQR()`, `mad()` - Range: `min()`, `max()`, `quantile()` - Position: `first()`, `last()`, `nth()` - Count: `n()`, `n_distinct()` - Logical: `any()`, `all()` --- ## `group_by()` ```{r eval=FALSE} flights %>% group_by(origin) ``` ```{r eval=FALSE} # A tibble: 336,776 x 19 # Groups: origin [3] #<< year month day dep_time sched_dep_time dep_delay arr_time 1 2013 1 1 517 515 2 830 2 2013 1 1 533 529 4 850 3 2013 1 1 542 540 2 923 4 2013 1 1 544 545 -1 1004 5 2013 1 1 554 600 -6 812 6 2013 1 1 554 558 -4 740 7 2013 1 1 555 600 -5 913 8 2013 1 1 557 600 -3 709 9 2013 1 1 557 600 -3 838 10 2013 1 1 558 600 -2 753 # … with 336,766 more rows, and 12 more variables: # sched_arr_time , arr_delay , carrier , # flight , tailnum , origin , dest , # air_time , distance , hour , minute , # time_hour ``` --- ## `group_by()` then `summarise()` ```{r} flights %>% group_by(origin) %>% summarize( n = n(), min_dep_delay = min(dep_delay, na.rm = TRUE), max_dep_delay = max(dep_delay, na.rm = TRUE) ) ``` --- ```{r} flights %>% group_by(origin, carrier) %>% summarize( n = n(), min_dep_delay = min(dep_delay, na.rm = TRUE), max_dep_delay = max(dep_delay, na.rm = TRUE) ) %>% filter(n > 10000) ``` --- ## `mutate()` with `group_by()` ```{r} flights %>% group_by(origin) %>% mutate(n = n()) %>% select(origin, n) ``` --- ## Example ```{r} flights %>% group_by(origin, month) %>% summarise(n = n()) %>% slice(1) ``` --
**Why do I have a tibble with three rows?** --- ## You may need to `ungroup()` ```{r} flights %>% group_by(origin, month) %>% summarise(n = n()) %>% ungroup() %>% slice(1) ``` --- ## `case_when()` - multi-case `if_else()` Suppose we want to break the parcel size into three categories: small, medium, large. -- .tiny[ ```{r} wake %>% mutate(lot_size = case_when( deed_acres < .5 ~ "small", deed_acres < 1.5 ~ "medium", deed_acres >= 1.5 ~ "large" ) ) %>% select(deed_acres, lot_size) ``` ] --- class: inverse, center, middle # Exercises --- ## Tasks Continue to use `wake` for the following tasks. 1. Compute the mean area for each design style.
2. Compute the median sale price for each year. *Hint:* `lubridate::year()`
3. Which city with at least 1,000 parcels classified as a "Townhouse" had the highest proportion of parcels as "Townhouse"? ??? ## Solutions .solution[ ```{r} wake %>% group_by(design_style_decode) %>% summarise(mean_area = mean(calc_area)) wake %>% filter(!is.na(sale_date)) %>% mutate(year = lubridate::year(sale_date)) %>% group_by(year) %>% summarise(median_sale_price = median(totsalprice, na.rm = TRUE)) wake %>% count(city_decode, design_style_decode) %>% group_by(city_decode) %>% mutate(prop = n / sum(n)) %>% ungroup() %>% filter(design_style_decode == "Townhouse", n >= 1000) %>% arrange(desc(prop)) %>% slice(1) ``` ] --- ## References - https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html - Parcels. (2020). Data-wake.opendata.arcgis.com. Retrieved 9 February 2020, from http://data-wake.opendata.arcgis.com/datasets/parcels