class: center, middle, inverse, title-slide # Data manipulation: dplyr ## Statistical Computing & Programming ### Shawn Santo --- ## Supplementary materials Full video lecture available in Zoom Cloud Recordings Additional resources - `dplyr` [cheat sheet](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf) - [dplyr vignette](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html) - [Chapter 5](https://r4ds.had.co.nz/transform.html), R for Data Science --- ## Getting started ```r library(tidyverse) ``` ```r ── Attaching packages ──────────────── tidyverse 1.3.0 ── ✓ ggplot2 3.3.2 ✓ purrr 0.3.4 ✓ tibble 3.0.3 ✓ dplyr 1.0.0 ✓ tidyr 1.1.0 ✓ stringr 1.4.0 ✓ readr 1.3.1 ✓ forcats 0.5.0 ── Conflicts ────────────────────── tidyverse_conflicts() ── x dplyr::filter() masks stats::filter() x dplyr::lag() masks stats::lag() ``` Also, load `nycflights13`. ```r library(nycflights13) ``` --- class: inverse, center, middle # Pipes --- ## Pipes in R Infix function `%>%` is a forward-pipe operator. It allows you to pipe an object forward into a function or call expression. You can think about the following sequence of actions - find keys, unlock car, start car, drive to school, park. <br/> Expressed as a set of nested functions in R pseudo code this would look like: ```r park(drive(start_car(unlock_car(find("keys"))), to = "campus")) ``` <br/> Writing it out using pipes give it a more natural (and easier to read) structure: ```r find("keys") %>% unlock_car() %>% start_car() %>% drive(to = "campus") %>% park() ``` --- ## Approaches All of the following are fine, it comes down to personal preference. Nested: ```r h(g(f(x), y = 1), z = 1) ``` Piped: ```r f(x) %>% g(y = 1) %>% h(z = 1) ``` Intermediate: ```r res <- f(x) res <- g(res, y = 1) res <- h(res, z = 1) ``` --- ## What about other arguments? By default, the object on the left-hand side of ` %>% ` is placed as the value to the first argument in the function on the right-hand side of ` %>% `. To pass the value to other arguments a `.` is used. For example, -- ```r data.frame(a = 1:3, b = 3:1) %>% lm(a ~ b, data = .) ``` ``` #> #> Call: #> lm(formula = a ~ b, data = .) #> #> Coefficients: #> (Intercept) b #> 4 -1 ``` -- ```r data.frame(a = 1:3, b = 3:1) %>% .[[1]] ``` ``` #> [1] 1 2 3 ``` -- ```r data.frame(a = 1:3, b = 3:1) %>% .[[length(.)]] ``` ``` #> [1] 3 2 1 ``` --- class: inverse, center, middle # Data wrangling with `dplyr` --- ## A grammar of data manipulation Package `dplyr` is based on the concepts of functions as verbs that manipulate data frames. Common single data frame functions / verbs: | Function | Description | Operates on | |:-------------------------------|:-------------------------------------|:---------------| | `filter()` | pick rows matching criteria | rows | | `slice()` | pick rows using indices | rows | | `arrange()` | reorder rows | rows | | `select()` | pick columns by name | columns | | `mutate()` | add new variables | columns | | `summarise()` | reduce variables to values | groups of rows | ... many more. --- ## `dplyr` rules 1. First argument is *always* a data frame 2. Subsequent arguments say what to do with that data frame 3. Almost always returns a data frame 4. Doesn't modify in place <br/><br/> Based on rules 1 and 3, it is natural to apply ` %>% ` in a sequence of `dplyr` functions for data wrangling purposes. --- ## Data We will demonstrate `dplyr`'s functionality using the `nycflights13` package. .small[ ```r flights ``` ``` #> # A tibble: 336,776 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 517 515 2 830 819 #> 2 2013 1 1 533 529 4 850 830 #> 3 2013 1 1 542 540 2 923 850 #> 4 2013 1 1 544 545 -1 1004 1022 #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> 7 2013 1 1 555 600 -5 913 854 #> 8 2013 1 1 557 600 -3 709 723 #> 9 2013 1 1 557 600 -3 838 846 #> 10 2013 1 1 558 600 -2 753 745 #> # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` ] --- class: inverse, center, middle # Basics --- ## `filter()` - March flights .small[ ```r flights %>% filter(month == 3) ``` ``` #> # A tibble: 28,834 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 3 1 4 2159 125 318 56 #> 2 2013 3 1 50 2358 52 526 438 #> 3 2013 3 1 117 2245 152 223 2354 #> 4 2013 3 1 454 500 -6 633 648 #> 5 2013 3 1 505 515 -10 746 810 #> 6 2013 3 1 521 530 -9 813 827 #> 7 2013 3 1 537 540 -3 856 850 #> 8 2013 3 1 541 545 -4 1014 1023 #> 9 2013 3 1 549 600 -11 639 703 #> 10 2013 3 1 550 600 -10 747 801 #> # … with 28,824 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` ] --- ## `filter()` - flights in the first 7 days of March .small[ ```r flights %>% filter(month == 3, day <= 7) ``` ``` #> # A tibble: 6,530 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 3 1 4 2159 125 318 56 #> 2 2013 3 1 50 2358 52 526 438 #> 3 2013 3 1 117 2245 152 223 2354 #> 4 2013 3 1 454 500 -6 633 648 #> 5 2013 3 1 505 515 -10 746 810 #> 6 2013 3 1 521 530 -9 813 827 #> 7 2013 3 1 537 540 -3 856 850 #> 8 2013 3 1 541 545 -4 1014 1023 #> 9 2013 3 1 549 600 -11 639 703 #> 10 2013 3 1 550 600 -10 747 801 #> # … with 6,520 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` ] --- ## `filter()` - flights to LAX *or* RDU in March .small[ ```r flights %>% filter(dest == "LAX" | dest == "RDU", month == 3) ``` ``` #> # A tibble: 1,935 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 3 1 607 610 -3 832 925 #> 2 2013 3 1 608 615 -7 737 750 #> 3 2013 3 1 623 630 -7 753 810 #> 4 2013 3 1 629 632 -3 844 952 #> 5 2013 3 1 657 700 -3 953 1034 #> 6 2013 3 1 714 715 -1 939 1037 #> 7 2013 3 1 716 710 6 958 1035 #> 8 2013 3 1 727 730 -3 1007 1100 #> 9 2013 3 1 803 810 -7 923 955 #> 10 2013 3 1 823 824 -1 954 1014 #> # … with 1,925 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` ] --- ## `slice()` - first 10 flights .small[ ```r flights %>% slice(1:10) ``` ``` #> # A tibble: 10 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 517 515 2 830 819 #> 2 2013 1 1 533 529 4 850 830 #> 3 2013 1 1 542 540 2 923 850 #> 4 2013 1 1 544 545 -1 1004 1022 #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> 7 2013 1 1 555 600 -5 913 854 #> 8 2013 1 1 557 600 -3 709 723 #> 9 2013 1 1 557 600 -3 838 846 #> 10 2013 1 1 558 600 -2 753 745 #> # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, #> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, #> # hour <dbl>, minute <dbl>, time_hour <dttm> ``` ] --- ## `slice()` - last 5 flights .small[ ```r flights %>% slice((n() - 4):n()) ``` ``` #> # A tibble: 5 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 9 30 NA 1455 NA NA 1634 #> 2 2013 9 30 NA 2200 NA NA 2312 #> 3 2013 9 30 NA 1210 NA NA 1330 #> 4 2013 9 30 NA 1159 NA NA 1344 #> 5 2013 9 30 NA 840 NA NA 1020 #> # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, #> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, #> # hour <dbl>, minute <dbl>, time_hour <dttm> ``` ] --- ## `select()` - specific variables ```r flights %>% select(year, month, day) ``` ``` #> # A tibble: 336,776 x 3 #> year month day #> <int> <int> <int> #> 1 2013 1 1 #> 2 2013 1 1 #> 3 2013 1 1 #> 4 2013 1 1 #> 5 2013 1 1 #> 6 2013 1 1 #> 7 2013 1 1 #> 8 2013 1 1 #> 9 2013 1 1 #> 10 2013 1 1 #> # … with 336,766 more rows ``` --- ## `select()` - exclude variables .small[ ```r flights %>% select(-year, -month, -day) ``` ``` #> # A tibble: 336,776 x 16 #> dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier #> <int> <int> <dbl> <int> <int> <dbl> <chr> #> 1 517 515 2 830 819 11 UA #> 2 533 529 4 850 830 20 UA #> 3 542 540 2 923 850 33 AA #> 4 544 545 -1 1004 1022 -18 B6 #> 5 554 600 -6 812 837 -25 DL #> 6 554 558 -4 740 728 12 UA #> 7 555 600 -5 913 854 19 B6 #> 8 557 600 -3 709 723 -14 EV #> 9 557 600 -3 838 846 -8 B6 #> 10 558 600 -2 753 745 8 AA #> # … with 336,766 more rows, and 9 more variables: flight <int>, tailnum <chr>, #> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, #> # minute <dbl>, time_hour <dttm> ``` ] --- ## `select()` - ranges ```r flights %>% select(year:day) ``` ``` #> # A tibble: 336,776 x 3 #> year month day #> <int> <int> <int> #> 1 2013 1 1 #> 2 2013 1 1 #> 3 2013 1 1 #> 4 2013 1 1 #> 5 2013 1 1 #> 6 2013 1 1 #> 7 2013 1 1 #> 8 2013 1 1 #> 9 2013 1 1 #> 10 2013 1 1 #> # … with 336,766 more rows ``` --- ## `select()` - exclude ranges .small[ ```r flights %>% select(-(year:day)) ``` ``` #> # A tibble: 336,776 x 16 #> dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier #> <int> <int> <dbl> <int> <int> <dbl> <chr> #> 1 517 515 2 830 819 11 UA #> 2 533 529 4 850 830 20 UA #> 3 542 540 2 923 850 33 AA #> 4 544 545 -1 1004 1022 -18 B6 #> 5 554 600 -6 812 837 -25 DL #> 6 554 558 -4 740 728 12 UA #> 7 555 600 -5 913 854 19 B6 #> 8 557 600 -3 709 723 -14 EV #> 9 557 600 -3 838 846 -8 B6 #> 10 558 600 -2 753 745 8 AA #> # … with 336,766 more rows, and 9 more variables: flight <int>, tailnum <chr>, #> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, #> # minute <dbl>, time_hour <dttm> ``` ] --- ## `arrange()` - sort data .small[ ```r flights %>% filter(month == 3, day == 2) %>% arrange(origin, dest) ``` ``` #> # A tibble: 765 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 3 2 1336 1329 7 1426 1432 #> 2 2013 3 2 628 629 -1 837 849 #> 3 2013 3 2 637 640 -3 903 915 #> 4 2013 3 2 743 745 -2 945 1010 #> 5 2013 3 2 857 900 -3 1117 1126 #> 6 2013 3 2 1027 1030 -3 1234 1247 #> 7 2013 3 2 1134 1145 -11 1332 1359 #> 8 2013 3 2 1412 1415 -3 1636 1630 #> 9 2013 3 2 1633 1636 -3 1848 1908 #> 10 2013 3 2 1655 1700 -5 1857 1924 #> # … with 755 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>, #> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, #> # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` ] --- ## `arrange()` & `desc()` By default, sorting is done in ascending order. To change that, use `desc()`. ```r flights %>% filter(month == 3, day == 2) %>% arrange(desc(origin), dest) %>% select(origin, dest, tailnum) ``` ``` #> # A tibble: 765 x 3 #> origin dest tailnum #> <chr> <chr> <chr> #> 1 LGA ATL N928AT #> 2 LGA ATL N623DL #> 3 LGA ATL N680DA #> 4 LGA ATL N996AT #> 5 LGA ATL N510MQ #> 6 LGA ATL N663DN #> 7 LGA ATL N942DL #> 8 LGA ATL N511MQ #> 9 LGA ATL N910DE #> 10 LGA ATL N902DE #> # … with 755 more rows ``` --- ## `mutate()` - modify columns ```r flights %>% select(year:day) %>% mutate(date = paste(year, month, day, sep = "/")) ``` ``` #> # A tibble: 336,776 x 4 #> year month day date #> <int> <int> <int> <chr> #> 1 2013 1 1 2013/1/1 #> 2 2013 1 1 2013/1/1 #> 3 2013 1 1 2013/1/1 #> 4 2013 1 1 2013/1/1 #> 5 2013 1 1 2013/1/1 #> 6 2013 1 1 2013/1/1 #> 7 2013 1 1 2013/1/1 #> 8 2013 1 1 2013/1/1 #> 9 2013 1 1 2013/1/1 #> 10 2013 1 1 2013/1/1 #> # … with 336,766 more rows ``` --- class: inverse, center, middle # Exercises --- ## Data: Wake county parcels Parcel boundaries with address and revenue-related information for properties in Wake County, NC. http://data-wake.opendata.arcgis.com/datasets/parcels .tiny[ ```r wake <- read_csv("https://www2.stat.duke.edu/~sms185/data/econ/parcels.csv") wake <- janitor::clean_names(wake) wake ``` ``` #> # A tibble: 378,020 x 59 #> objectid pin_num calc_area reid map_name owner addr1 addr2 addr3 deed_book #> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 31257151 170151… 0.596 0004… 1701 19 HAMI… 500 … RALE… <NA> 002618 #> 2 31257152 074533… 0.0660 0240… 0745 03 MEGA… 104 … CARY… <NA> 015772 #> 3 31257153 075321… 0.177 0337… 0753 17 EMMA… 3038… CARY… <NA> 012192 #> 4 31257154 174377… 0.247 0340… 1743 02 OGUN… 4608… KNIG… <NA> 016213 #> 5 31257155 185005… 0.262 0358… 1850 01 LAY,… 1328… WAKE… <NA> 014787 #> 6 31257156 174379… 0.162 0350… 1743 02 RICH… 4014… KNIG… <NA> 012662 #> 7 31257157 160727… 1.11 0178… 1607 01 GROS… 1501… RALE… <NA> 015681 #> 8 31257158 069671… 1.01 0118… 0696 04 L & … 1217… RALE… <NA> 016063 #> 9 31257159 160871… 2.74 0137… 1608 04 SOWA… 901 … RALE… <NA> 012208 #> 10 31257160 068570… 0.837 0026… 0685 04 KNIG… 7897… WILL… <NA> 016509 #> # … with 378,010 more rows, and 49 more variables: deed_page <chr>, #> # deed_date <dttm>, deed_acres <dbl>, bldg_val <dbl>, land_val <dbl>, #> # total_value_assd <dbl>, billclass <dbl>, billing_class_decode <chr>, #> # propdesc <chr>, heatedarea <dbl>, stname <chr>, stype <chr>, stpre <chr>, #> # stsuf <chr>, stnum <dbl>, stmisc <chr>, site_address <chr>, #> # full_street_name <chr>, city <chr>, city_decode <chr>, #> # planning_jurisdiction <chr>, township <chr>, township_decode <chr>, #> # firedist <dbl>, year_built <dbl>, totsalprice <dbl>, sale_date <dttm>, #> # type_and_use <chr>, type_use_decode <chr>, designstyl <chr>, #> # design_style_decode <chr>, units <dbl>, land_class <chr>, #> # land_class_decode <chr>, exemptdesc <chr>, exemptstat <chr>, #> # ownership <dbl>, activity <dbl>, `function` <dbl>, structure <dbl>, #> # site <dbl>, totstructs <dbl>, totunits <dbl>, old_parcel_number <chr>, #> # zipnum <chr>, shapearea <dbl>, shapelen <dbl>, parcel_pk <dbl>, #> # land_code <chr> ``` ] --- ## Tasks 1. Which city has the fewest land parcels in the dataset? *Hint*: `count()`. <br><br> 2. Create a tibble that shows the year a parcel was built and the total value, where all parcels are located in Apex and are more than one acre in area. Sort the result in ascending order by year built. <br><br> 3. Choose a subset of five variables and 10 rows from `wake` and save it as an object named `wake_mini`. Experiment renaming variables with `select()` and `rename()` on `wake_mini`. What is the difference between the two functions? ??? ## Solutions .solution[ ```r wake %>% count(city_decode) %>% arrange(n) %>% slice(1) ``` ``` #> # A tibble: 1 x 2 #> city_decode n #> <chr> <int> #> 1 CLAYTON 3 ``` ```r wake %>% filter(city_decode == "APEX", calc_area > 1) %>% select(year_built, total_value_assd) %>% arrange(year_built) ``` ``` #> # A tibble: 1,274 x 2 #> year_built total_value_assd #> <dbl> <dbl> #> 1 0 5250581 #> 2 1870 275017 #> 3 1890 636565 #> 4 1890 1136357 #> 5 1900 4310465 #> 6 1910 10272692 #> 7 1910 5244587 #> 8 1910 483127 #> 9 1920 1736577 #> 10 1920 703483 #> # … with 1,264 more rows ``` - Function `select()` only keeps the variables you mention. - Function `rename()` keeps all variables. ] --- ## `summarise()` ```r flights %>% summarize(n(), min(dep_delay), max(dep_delay)) ``` ``` #> # A tibble: 1 x 3 #> `n()` `min(dep_delay)` `max(dep_delay)` #> <int> <dbl> <dbl> #> 1 336776 NA NA ``` -- ```r flights %>% summarize( n = n(), min_dep_delay = min(dep_delay, na.rm = TRUE), max_dep_delay = max(dep_delay, na.rm = TRUE) ) ``` ``` #> # A tibble: 1 x 3 #> n min_dep_delay max_dep_delay #> <int> <dbl> <dbl> #> 1 336776 -43 1301 ``` --- ## Useful `summarise()` helper functions - Center: `mean()`, `median()` - Spread: `sd()`, `IQR()`, `mad()` - Range: `min()`, `max()`, `quantile()` - Position: `first()`, `last()`, `nth()` - Count: `n()`, `n_distinct()` - Logical: `any()`, `all()` --- ## `group_by()` ```r flights %>% group_by(origin) ``` ```r # A tibble: 336,776 x 19 *# Groups: origin [3] year month day dep_time sched_dep_time dep_delay arr_time <int> <int> <int> <int> <int> <dbl> <int> 1 2013 1 1 517 515 2 830 2 2013 1 1 533 529 4 850 3 2013 1 1 542 540 2 923 4 2013 1 1 544 545 -1 1004 5 2013 1 1 554 600 -6 812 6 2013 1 1 554 558 -4 740 7 2013 1 1 555 600 -5 913 8 2013 1 1 557 600 -3 709 9 2013 1 1 557 600 -3 838 10 2013 1 1 558 600 -2 753 # … with 336,766 more rows, and 12 more variables: # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, # time_hour <dttm> ``` --- ## `group_by()` then `summarise()` ```r flights %>% group_by(origin) %>% summarize( n = n(), min_dep_delay = min(dep_delay, na.rm = TRUE), max_dep_delay = max(dep_delay, na.rm = TRUE) ) ``` ``` #> # A tibble: 3 x 4 #> origin n min_dep_delay max_dep_delay #> <chr> <int> <dbl> <dbl> #> 1 EWR 120835 -25 1126 #> 2 JFK 111279 -43 1301 #> 3 LGA 104662 -33 911 ``` --- ```r flights %>% * group_by(origin, carrier) %>% summarize( n = n(), min_dep_delay = min(dep_delay, na.rm = TRUE), max_dep_delay = max(dep_delay, na.rm = TRUE), * .groups = "drop" ) %>% filter(n > 10000) ``` ``` #> # A tibble: 10 x 5 #> origin carrier n min_dep_delay max_dep_delay #> <chr> <chr> <int> <dbl> <dbl> #> 1 EWR EV 43939 -25 548 #> 2 EWR UA 46087 -18 424 #> 3 JFK 9E 14651 -24 747 #> 4 JFK AA 13783 -15 1014 #> 5 JFK B6 42076 -43 453 #> 6 JFK DL 20701 -18 960 #> 7 LGA AA 15459 -24 803 #> 8 LGA DL 23067 -33 911 #> 9 LGA MQ 16928 -26 366 #> 10 LGA US 13136 -18 500 ``` --- ## `mutate()` with `group_by()` ```r flights %>% group_by(origin) %>% mutate(n = n()) %>% select(origin, n) ``` ``` #> # A tibble: 336,776 x 2 #> # Groups: origin [3] #> origin n #> <chr> <int> #> 1 EWR 120835 #> 2 LGA 104662 #> 3 JFK 111279 #> 4 JFK 111279 #> 5 LGA 104662 #> 6 EWR 120835 #> 7 EWR 120835 #> 8 LGA 104662 #> 9 JFK 111279 #> 10 LGA 104662 #> # … with 336,766 more rows ``` --- ## Example ```r flights %>% group_by(origin, month) %>% summarise(n = n()) %>% slice(1) ``` ``` #> # A tibble: 3 x 3 #> # Groups: origin [3] #> origin month n #> <chr> <int> <int> #> 1 EWR 1 9893 #> 2 JFK 1 9161 #> 3 LGA 1 7950 ``` -- <br/> **Why do I have a tibble with three rows?** --- ## You may need to `ungroup()` ```r flights %>% group_by(origin, month) %>% summarise(n = n()) %>% * ungroup() %>% slice(1) ``` ``` #> # A tibble: 1 x 3 #> origin month n #> <chr> <int> <int> #> 1 EWR 1 9893 ``` -- Or set the `.groups` argument in `summarise()` to `"drop"`. This is a new feature is `dplyr` version `1.0.0`. --- ## `case_when()` - multi-case `if_else()` Suppose we want to break the Wake county parcel sizes into three categories: small, medium, large. -- .small[ ```r wake %>% mutate(lot_size = case_when( deed_acres < .5 ~ "small", deed_acres < 1.5 ~ "medium", deed_acres >= 1.5 ~ "large" )) %>% select(deed_acres, lot_size) ``` ``` #> # A tibble: 378,020 x 2 #> deed_acres lot_size #> <dbl> <chr> #> 1 0.6 medium #> 2 0.07 small #> 3 0.18 small #> 4 0.25 small #> 5 0.26 small #> 6 0.16 small #> 7 1.15 medium #> 8 1.01 medium #> 9 3.61 large #> 10 1 medium #> # … with 378,010 more rows ``` ] --- class: inverse, center, middle # Exercises --- ## Tasks Continue to use `wake` for the following tasks. 1. Compute the mean area for each design style. <br/> 2. Compute the median sale price for each year. *Hint:* `lubridate::year()` <br/> 3. Which city with at least 1,000 parcels classified as a "Townhouse" had the highest proportion of parcels as "Townhouse"? ??? ## Solutions .solution[ ```r wake %>% group_by(design_style_decode) %>% summarise(mean_area = mean(calc_area)) ``` ``` #> # A tibble: 17 x 2 #> design_style_decode mean_area #> <chr> <dbl> #> 1 Cape 0.931 #> 2 Colonial 4.08 #> 3 Condo 1.71 #> 4 Contemporary 24.5 #> 5 Conventional 1.01 #> 6 Conversion 0.448 #> 7 Duplex 0.456 #> 8 Log 6.89 #> 9 MANUF MULTI 1.68 #> 10 MANUF SNGL 2.73 #> 11 MODULAR 1.60 #> 12 NONE 31.4 #> 13 Ranch 0.813 #> 14 Split Foyer 0.449 #> 15 Split Level 0.476 #> 16 Townhouse 0.0595 #> 17 <NA> 3.66 ``` ```r wake %>% filter(!is.na(sale_date)) %>% mutate(year = lubridate::year(sale_date)) %>% group_by(year) %>% summarise(median_sale_price = median(totsalprice, na.rm = TRUE)) ``` ``` #> # A tibble: 62 x 2 #> year median_sale_price #> <dbl> <dbl> #> 1 1956 500 #> 2 1958 8500 #> 3 1961 1100 #> 4 1962 5500 #> 5 1963 15900 #> 6 1964 6000 #> 7 1965 47300 #> 8 1966 6400 #> 9 1967 8800 #> 10 1968 17000 #> # … with 52 more rows ``` ```r wake %>% count(city_decode, design_style_decode) %>% group_by(city_decode) %>% mutate(prop = n / sum(n)) %>% ungroup() %>% filter(design_style_decode == "Townhouse", n >= 1000) %>% arrange(desc(prop)) %>% slice(1) ``` ``` #> # A tibble: 1 x 4 #> city_decode design_style_decode n prop #> <chr> <chr> <int> <dbl> #> 1 MORRISVILLE Townhouse 2379 0.318 ``` ] --- class: inverse, center, middle # New and advanced functionality --- ## `across()` examples Apply a function, or set of functions, to columns. .small[ ```r flights %>% mutate(across(.cols = where(is.integer), .fns = as.double)) ``` ``` #> # A tibble: 336,776 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 2013 1 1 517 515 2 830 819 #> 2 2013 1 1 533 529 4 850 830 #> 3 2013 1 1 542 540 2 923 850 #> 4 2013 1 1 544 545 -1 1004 1022 #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> 7 2013 1 1 555 600 -5 913 854 #> 8 2013 1 1 557 600 -3 709 723 #> 9 2013 1 1 557 600 -3 838 846 #> 10 2013 1 1 558 600 -2 753 745 #> # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <dbl>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` ] --- ## `across()` examples .small[ ```r flights %>% filter(month == 12, day >= 15) %>% group_by(day) %>% summarise(across(c(dep_delay, arr_delay), ~median(., na.rm = TRUE))) ``` ``` #> # A tibble: 17 x 3 #> day dep_delay arr_delay #> <int> <dbl> <dbl> #> 1 15 4 4 #> 2 16 0 0 #> 3 17 18 38 #> 4 18 4 5 #> 5 19 4 0 #> 6 20 3 3.5 #> 7 21 5 1 #> 8 22 12 10 #> 9 23 12.5 15 #> 10 24 -0.5 -5 #> 11 25 -1 -9 #> 12 26 1 -2 #> 13 27 0 -5 #> 14 28 -1 -9 #> 15 29 3 3 #> 16 30 -1 1 #> 17 31 -1 1 ``` ] --- ## `across()` examples .small[ ```r flights %>% filter(month == 12, day >= 15) %>% group_by(day) %>% summarise(across(.cols = c(dep_delay, arr_delay), .fns = ~median(., na.rm = TRUE), .names = "median_{.col}")) ``` ``` #> # A tibble: 17 x 3 #> day median_dep_delay median_arr_delay #> <int> <dbl> <dbl> #> 1 15 4 4 #> 2 16 0 0 #> 3 17 18 38 #> 4 18 4 5 #> 5 19 4 0 #> 6 20 3 3.5 #> 7 21 5 1 #> 8 22 12 10 #> 9 23 12.5 15 #> 10 24 -0.5 -5 #> 11 25 -1 -9 #> 12 26 1 -2 #> 13 27 0 -5 #> 14 28 -1 -9 #> 15 29 3 3 #> 16 30 -1 1 #> 17 31 -1 1 ``` ] --- ## `across()` examples .small[ ```r flights %>% filter(month == 12, day >= 15) %>% group_by(day) %>% summarise(across(.cols = c(dep_delay, arr_delay), .fns = list(mean = ~mean(., na.rm = TRUE), sd = ~sd(., na.rm = TRUE)), .names = "{.fn}_{.col}")) ``` ``` #> # A tibble: 17 x 5 #> day mean_dep_delay sd_dep_delay mean_arr_delay sd_arr_delay #> <int> <dbl> <dbl> <dbl> <dbl> #> 1 15 19.6 44.0 15.9 46.6 #> 2 16 11.7 30.9 7.37 32.7 #> 3 17 40.7 63.5 55.9 63.4 #> 4 18 14.0 31.9 11.1 33.8 #> 5 19 16.4 41.6 9.01 42.7 #> 6 20 17.3 33.5 13.8 36.6 #> 7 21 20.3 41.3 12.5 42.4 #> 8 22 29.2 46.7 23.9 48.6 #> 9 23 32.3 52.0 32.2 54.5 #> 10 24 6.77 24.5 -1.04 26.8 #> 11 25 7.55 30.0 -1.90 31.9 #> 12 26 14.4 33.6 7.17 37.3 #> 13 27 10.9 28.5 -0.149 32.5 #> 14 28 7.98 25.7 -3.26 28.9 #> 15 29 22.3 46.6 18.8 50.1 #> 16 30 10.7 32.0 10.1 35.9 #> 17 31 7.00 26.3 6.21 30.5 ``` ] --- ## `across()` examples .small[ ```r min_max <- list( min = ~min(., na.rm = TRUE), max = ~max(., na.rm = TRUE) ) flights %>% summarise(across(where(is.double), min_max)) ``` ``` #> # A tibble: 1 x 14 #> dep_delay_min dep_delay_max arr_delay_min arr_delay_max air_time_min #> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 -43 1301 -86 1272 20 #> # … with 9 more variables: air_time_max <dbl>, distance_min <dbl>, #> # distance_max <dbl>, hour_min <dbl>, hour_max <dbl>, minute_min <dbl>, #> # minute_max <dbl>, time_hour_min <dttm>, time_hour_max <dttm> ``` ] --- ## Notes on `across()` Argument `.fns` can take - A function, e.g. `mean`. - A purrr-style lambda, e.g. `~ mean(.x, na.rm = TRUE)` - A list of functions/lambdas, e.g. `list(mean = mean, n_miss = ~ sum(is.na(.x))` Prior versions of dplyr allowed you to apply a function to multiple columns in a different way: using functions with `_if`, `_at`, and `_all()` suffixes. These functions are now superseded. --- ## More `select()` - matching .small[ ```r flights %>% select(contains("dep"), contains("arr")) ``` ``` #> # A tibble: 336,776 x 7 #> dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier #> <int> <int> <dbl> <int> <int> <dbl> <chr> #> 1 517 515 2 830 819 11 UA #> 2 533 529 4 850 830 20 UA #> 3 542 540 2 923 850 33 AA #> 4 544 545 -1 1004 1022 -18 B6 #> 5 554 600 -6 812 837 -25 DL #> 6 554 558 -4 740 728 12 UA #> 7 555 600 -5 913 854 19 B6 #> 8 557 600 -3 709 723 -14 EV #> 9 557 600 -3 838 846 -8 B6 #> 10 558 600 -2 753 745 8 AA #> # … with 336,766 more rows ``` ] --- ## More `select()` - matching .small[ ```r flights %>% select(starts_with("dep"), starts_with("arr")) ``` ``` #> # A tibble: 336,776 x 4 #> dep_time dep_delay arr_time arr_delay #> <int> <dbl> <int> <dbl> #> 1 517 2 830 11 #> 2 533 4 850 20 #> 3 542 2 923 33 #> 4 544 -1 1004 -18 #> 5 554 -6 812 -25 #> 6 554 -4 740 12 #> 7 555 -5 913 19 #> 8 557 -3 709 -14 #> 9 557 -3 838 -8 #> 10 558 -2 753 8 #> # … with 336,766 more rows ``` ] See `?select` for information on other helpers: - `ends_with()` - `matches()` - `num_range()` - `one_of()` - `everything()` - `last_col()` --- class: inverse, center, middle # Appendix: more `dplyr` examples --- ## Less commonly used functions These are all single data frame functions. | Function | Description | |:-------------------------------|:-------------------------------------| | `pull()` | grab a column as a vector | | `transmute()` | create new data frame with variables | | `distinct()` | filter for unique rows | | `sample_n()` / `sample_frac()` | randomly sample rows | --- ## `pull()` - grab a vector ```r names(flights) ``` ``` #> [1] "year" "month" "day" "dep_time" #> [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time" #> [9] "arr_delay" "carrier" "flight" "tailnum" #> [13] "origin" "dest" "air_time" "distance" #> [17] "hour" "minute" "time_hour" ``` ```r flights %>% pull("year") %>% head() ``` ``` #> [1] 2013 2013 2013 2013 2013 2013 ``` -- ```r flights %>% pull(1) %>% head ``` ``` #> [1] 2013 2013 2013 2013 2013 2013 ``` -- ```r flights %>% pull(-1) %>% .[1:4] ``` ``` #> [1] "2013-01-01 05:00:00 EST" "2013-01-01 05:00:00 EST" #> [3] "2013-01-01 05:00:00 EST" "2013-01-01 05:00:00 EST" ``` --- ## `distinct()` - find unique rows ```r flights %>% select(origin, dest) %>% distinct() %>% arrange(origin, dest) ``` ``` #> # A tibble: 224 x 2 #> origin dest #> <chr> <chr> #> 1 EWR ALB #> 2 EWR ANC #> 3 EWR ATL #> 4 EWR AUS #> 5 EWR AVL #> 6 EWR BDL #> 7 EWR BNA #> 8 EWR BOS #> 9 EWR BQN #> 10 EWR BTV #> # … with 214 more rows ``` --- ## Sampling rows .pull-left[ `sample_n()` ```r flights %>% select(year, origin) %>% sample_n(10) ``` ``` #> # A tibble: 10 x 2 #> year origin #> <int> <chr> #> 1 2013 JFK #> 2 2013 LGA #> 3 2013 JFK #> 4 2013 LGA #> 5 2013 JFK #> 6 2013 EWR #> 7 2013 LGA #> 8 2013 LGA #> 9 2013 LGA #> 10 2013 JFK ``` ] .pull-right[ `sample_frac()` ```r flights %>% select(year, origin) %>% sample_frac(.00001) ``` ``` #> # A tibble: 3 x 2 #> year origin #> <int> <chr> #> 1 2013 EWR #> 2 2013 EWR #> 3 2013 EWR ``` ] --- ## References 1. A Grammar of Data Manipulation. (2020). https://dplyr.tidyverse.org/index.html 2. Parcels. (2020). Data-wake.opendata.arcgis.com. http://data-wake.opendata.arcgis.com/datasets/parcels