class: center, middle, inverse, title-slide # Memory and big data ## Statistical Computing & Programming ### Shawn Santo ### 06-15-20 --- ## Supplementary materials Companion videos - [Memory and atomic vectors](https://warpwire.duke.edu/w/kdcDAA/) - [Memory and lists](https://warpwire.duke.edu/w/k9cDAA/) - [Object sizes and input/output](https://warpwire.duke.edu/w/ldcDAA/) - [Working with package `multidplyr`](https://warpwire.duke.edu/w/l9cDAA/) Additional resources - [Chapter 2](https://adv-r.hadley.nz/names-values.html), Advanced R by Wickham, H. - `vroom` [vignette](https://cran.r-project.org/web/packages/vroom/vignettes/vroom.html) - `multidplyr` [vignette](https://multidplyr.tidyverse.org/articles/multidplyr.html) - [Programming with BIG Data in R: Scaling Analytics from One to Thousands of Nodes](https://www.sciencedirect.com/science/article/pii/S221457961630065X) by Schmidt, D., Chen, W., Matheson, M., & Ostrouchov, G. --- class: inverse, center, middle # Memory basics --- ## Names and values In R, a name has a value. It is not the value that has a name. For example, in ```r x <- c(-3, 4, 1) ``` the object named `x` is a reference to vector `c(-3, 4, 1)`. <br/> <center> <img src="images/name_bind1.png"> </center> --- We can see where this lives in memory with ```r library(lobstr) lobstr::obj_addr(x) ``` ``` #> [1] "0x7f8be6657298" ``` and its size with ```r lobstr::obj_size(x) ``` ``` #> 80 B ``` --- ## Copy-on-modify: atomic vectors Understanding when R creates a copy of an object will allow you to write faster code. ```r x <- c(-3, 4, 1) y <- x ``` -- ```r obj_addr(x) ``` ``` #> [1] "0x7f8be65cfd28" ``` ```r obj_addr(y) ``` ``` #> [1] "0x7f8be65cfd28" ``` <center> <img src="images/name_bind2.png"> </center> --- ```r y[3] <- 100 ``` -- ```r obj_addr(x) ``` ``` #> [1] "0x7f8be65cfd28" ``` ```r obj_addr(y) ``` ``` #> [1] "0x7f8be23d9c98" ``` <center> <img src="images/name_bind3.png"> </center> --- .pull-left[ ```r x <- c(0, 1, 9) y <- x obj_addr(x) ``` ``` #> [1] "0x7f8be38c83a8" ``` ```r obj_addr(y) ``` ``` #> [1] "0x7f8be38c83a8" ``` ```r y[4] <- -100 obj_addr(x) ``` ``` #> [1] "0x7f8be38c83a8" ``` ```r obj_addr(y) ``` ``` #> [1] "0x7f8be234f5c8" ``` ] .pull-right[ <br/> <center> <img src="images/name_bind4.png"> </center> <br/><br/> <center> <img src="images/name_bind5.png"> </center> ] <br/> -- Even though only one component changed in the atomic vector `y`, R created a new object as seen by the new address in memory. --- ## Memory tracking Function `tracemem()` marks an object so that a message is printed whenever the internal code copies the object. Let's see when `x` gets copied. <br/><br/> ```r x <- c(0, 1, 1, 2, 3, 5, 8, 13, 21, 34) tracemem(x) ``` ``` #> [1] "<0x7f8be23dd918>" ``` -- ```r y <- x ``` -- ```r y[1] <- 0 ``` ``` #> tracemem[0x7f8be23dd918 -> 0x7f8be24c9188]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> ``` --- ```r x ``` ``` #> [1] 0 1 1 2 3 5 8 13 21 34 ``` ```r y ``` ``` #> [1] 0 1 1 2 3 5 8 13 21 34 ``` ```r c(obj_addr(x), obj_addr(y)) ``` ``` #> [1] "0x7f8be23dd918" "0x7f8be24c9188" ``` ```r x[1] <- 0 ``` ``` #> tracemem[0x7f8be23dd918 -> 0x7f8be22fbb88]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> ``` ```r ref(x) ``` ``` #> [1:0x7f8be22fbb88] <dbl> ``` ```r ref(y) ``` ``` #> [1:0x7f8be24c9188] <dbl> ``` ```r untracemem(x) ``` --- ## Copy-on-modify: lists ```r x <- list(a = 1, b = 2, c = 3) obj_addr(x) ``` ``` #> [1] "0x7f8be2382498" ``` -- ```r y <- x ``` -- ```r c(obj_addr(x), obj_addr(y)) ``` ``` #> [1] "0x7f8be2382498" "0x7f8be2382498" ``` -- ```r ref(x, y) ``` ``` #> █ [1:0x7f8be2382498] <named list> #> ├─a = [2:0x7f8be667acf8] <dbl> #> ├─b = [3:0x7f8be667acc0] <dbl> #> └─c = [4:0x7f8be667ac88] <dbl> #> #> [1:0x7f8be2382498] ``` --- ```r y$c <- 4 ``` -- ```r ref(x, y) ``` ``` #> █ [1:0x7f8be2382498] <named list> #> ├─a = [2:0x7f8be667acf8] <dbl> #> ├─b = [3:0x7f8be667acc0] <dbl> #> └─c = [4:0x7f8be667ac88] <dbl> #> #> █ [5:0x7f8be9b42818] <named list> #> ├─a = [2:0x7f8be667acf8] #> ├─b = [3:0x7f8be667acc0] #> └─c = [6:0x7f8be9b15790] <dbl> ``` --- ```r x <- list(a = 1, b = 2, c = 3) y <- x ``` -- ```r c(obj_addr(x), obj_addr(y)) ``` ``` #> [1] "0x7f8be9cc3898" "0x7f8be9cc3898" ``` -- ```r y$d <- 9 ref(x, y) ``` ``` #> █ [1:0x7f8be9cc3898] <named list> #> ├─a = [2:0x7f8be9c33110] <dbl> #> ├─b = [3:0x7f8be9c330d8] <dbl> #> └─c = [4:0x7f8be9c330a0] <dbl> #> #> █ [5:0x7f8be7ff0b98] <named list> #> ├─a = [2:0x7f8be9c33110] #> ├─b = [3:0x7f8be9c330d8] #> ├─c = [4:0x7f8be9c330a0] #> └─d = [6:0x7f8be9e87120] <dbl> ``` <br/> R creates a shallow copy. Shared components exist with elements `a`, `b`, and `c`. --- ## Copy-on-modify: data frames ```r library(tidyverse) x <- tibble(a = 1:3, b = 9:7) ``` -- ```r ref(x) ``` ``` #> █ [1:0x7f8beadb80c8] <tibble> #> ├─a = [2:0x7f8bea08c9e8] <int> #> └─b = [3:0x7f8bea093f20] <int> ``` -- ```r y <- x %>% mutate(b = b ^ 2) ``` -- ```r ref(x, y) ``` ``` #> █ [1:0x7f8beadb80c8] <tibble> #> ├─a = [2:0x7f8bea08c9e8] <int> #> └─b = [3:0x7f8bea093f20] <int> #> #> █ [4:0x7f8be73ee748] <tibble> #> ├─a = [2:0x7f8bea08c9e8] #> └─b = [5:0x7f8beb2871b8] <dbl> ``` --- ```r z <- x ref(x, z) ``` ``` #> █ [1:0x7f8beadb80c8] <tibble> #> ├─a = [2:0x7f8bea08c9e8] <int> #> └─b = [3:0x7f8bea093f20] <int> #> #> [1:0x7f8beadb80c8] ``` -- ```r z <- x %>% add_row(a = -1, b = -1) ``` -- ```r ref(x, z) ``` ``` #> █ [1:0x7f8beadb80c8] <tibble> #> ├─a = [2:0x7f8bea08c9e8] <int> #> └─b = [3:0x7f8bea093f20] <int> #> #> █ [4:0x7f8beb311fc8] <tibble> #> ├─a = [5:0x7f8be9305e58] <dbl> #> └─b = [6:0x7f8be9305db8] <dbl> ``` -- <br/> If you modify a column, only that column needs to be copied in memory. However, if you modify a row, the entire data frame is copied in memory. --- ## Exercise Can you diagnose what is going on below? ```r x <- 1:10; y <- x; tracemem(x) ``` ``` #> [1] "<0x7f8be9a6b6e8>" ``` ```r c(obj_addr(x), obj_addr(y)) ``` ``` #> [1] "0x7f8be9a6b6e8" "0x7f8be9a6b6e8" ``` ```r y[1] <- 3 ``` ``` #> tracemem[0x7f8be9a6b6e8 -> 0x7f8be92a9918]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> #> tracemem[0x7f8be92a9918 -> 0x7f8be9295ee8]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> ``` --- ## Object size Object sizes can sometimes be deceiving. ```r x <- rnorm(1e6) y <- 1:1e6 z <- seq(1, 1e6, by = 1) s <- (1:1e6) / 2 ``` -- ```r c(obj_size(x), obj_size(y), obj_size(z), obj_size(s)) ``` ``` #> * 8,000,048 B #> * 680 B #> * 8,000,048 B #> * 8,000,048 B ``` --- ```r c(obj_size(c(1L)), obj_size(c(1.0))) ``` ``` #> * 56 B #> * 56 B ``` -- ```r c(obj_size(c(1L, 2L)), obj_size(as.numeric(c(1.0, 2.0)))) ``` ``` #> * 56 B #> * 64 B ``` -- ```r c(obj_size(c(1L, 2L, 3L)), obj_size(as.numeric(c(1.0, 2.0, 3.0)))) ``` ``` #> * 64 B #> * 80 B ``` -- ```r c(obj_size(integer(10000)), obj_size(numeric(10000))) ``` ``` #> * 40,048 B #> * 80,048 B ``` <br/> -- There is overhead with creating vectors in R. Take a look at `?Memory` if you want to dig deeper as to the overhead cost. --- ## Exercise Starting from 0 we can see that ```r lobstr::obj_size(integer(0)) ``` ``` #> 48 B ``` ```r lobstr::obj_size(numeric(0)) ``` ``` #> 48 B ``` are both 48 bytes. Based on the results on the next slide can you deduce how R handles these numeric data in memory? --- ```r diff(sapply(0:100, function(x) lobstr::obj_size(integer(x)))) ``` ``` #> [1] 8 0 8 0 16 0 0 0 16 0 0 0 16 0 0 0 64 0 0 0 0 0 0 #> [24] 0 0 0 0 0 0 0 0 0 8 0 8 0 8 0 8 0 8 0 8 0 8 0 #> [47] 8 0 8 0 8 0 8 0 8 0 8 0 8 0 8 0 8 0 8 0 8 0 8 #> [70] 0 8 0 8 0 8 0 8 0 8 0 8 0 8 0 8 0 8 0 8 0 8 0 #> [93] 8 0 8 0 8 0 8 0 ``` ```r c(obj_size(integer(20)), obj_size(integer(22))) ``` ``` #> * 176 B #> * 176 B ``` ```r diff(sapply(0:100, function(x) lobstr::obj_size(numeric(x)))) ``` ``` #> [1] 8 8 16 0 16 0 16 0 64 0 0 0 0 0 0 0 8 8 8 8 8 8 8 #> [24] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 #> [47] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 #> [70] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 #> [93] 8 8 8 8 8 8 8 8 ``` ```r c(obj_size(numeric(10)), obj_size(numeric(14))) ``` ``` #> * 176 B #> * 176 B ``` --- class: inverse, center, middle # I/O big data --- ## Getting .small[big] data into R ```r url <- "http://www2.stat.duke.edu/~sms185/data/bike/cbs_2015.csv" ``` .tiny[ ```r system.time({ d <- read.csv(url) }) ``` ```r user system elapsed 29.739 1.085 37.321 ``` ] -- .tiny[ ```r system.time({ d <- readr::read_csv(url) }) ``` ```r Parsed with column specification: cols( Duration = col_double(), `Start date` = col_datetime(format = ""), `End date` = col_datetime(format = ""), `Start station number` = col_double(), `Start station` = col_character(), `End station number` = col_double(), `End station` = col_character(), `Bike number` = col_character(), `Member type` = col_character() ) |================================| 100% 369 MB user system elapsed 12.773 1.727 22.327 ``` ] --- .tiny[ ```r system.time({ d <- data.table::fread(url) }) ``` ```r trying URL 'http://www2.stat.duke.edu/~sms185/data/bike/cbs_2015.csv' Content type 'text/csv' length 387899567 bytes (369.9 MB) ================================================== downloaded 369.9 MB user system elapsed 7.363 2.009 19.942 ``` ] -- .tiny[ ```r system.time({ d <- vroom::vroom(url) }) ``` ```r Observations: 3,185,906 Variables: 9 chr [4]: Start station, End station, Bike number, Member type dbl [3]: Duration, Start station number, End station number dttm [2]: Start date, End date Call `spec()` for a copy-pastable column specification Specify the column types with `col_types` to quiet this message user system elapsed 5.873 2.361 18.606 ``` ] --- ## Getting bigger data into R ```r url <- "http://www2.stat.duke.edu/~sms185/data/bike/full.csv" ``` .tiny[ ```r system.time({ d <- read.csv(url) }) ``` ```r user system elapsed 119.472 5.037 139.214 ``` ] -- .tiny[ ```r system.time({ d <- readr::read_csv(url) }) ``` ```r Parsed with column specification: cols( Duration = col_double(), `Start date` = col_datetime(format = ""), `End date` = col_datetime(format = ""), `Start station number` = col_double(), `Start station` = col_character(), `End station number` = col_double(), `End station` = col_character(), `Bike number` = col_character(), `Member type` = col_character() ) |================================| 100% 1191 MB user system elapsed 46.845 7.607 87.425 ``` ] --- .tiny[ ```r system.time({ d <- data.table::fread(url) }) ``` ```r trying URL 'http://www2.stat.duke.edu/~sms185/data/bike/full.csv' Content type 'text/csv' length 1249306730 bytes (1191.4 MB) ================================================== downloaded 1191.4 MB |--------------------------------------------------| |==================================================| user system elapsed 33.402 7.249 79.806 ``` ] -- .tiny[ ```r system.time({ d <- vroom::vroom(url) }) ``` ```r Observations: 10,277,677 Variables: 9 chr [4]: Start station, End station, Bike number, Member type dbl [3]: Duration, Start station number, End station number dttm [2]: Start date, End date Call `spec()` for a copy-pastable column specification Specify the column types with `col_types` to quiet this message user system elapsed 18.837 6.731 57.203 ``` ] --- ## Summary | Function | Elapsed Time (s) | |----------------------:|:------------:| | `vroom::vroom()` | ~57 | | `data.table::fread()` | ~80 | | `readr::read_csv()` | ~87 | | `read.csv()` | ~139 | <br/> .small[ Observations: 10,277,677 Variables: 9 ] --- class: inverse, center, middle # Package `multidplyr` --- ## Purpose and getting started `multidplyr` is a backend for dplyr that partitions a data frame across multiple cores. This will be valuable if you have to work with massive data and have the ability to parallelize. <br/> ```r devtools::install_github("tidyverse/multidplyr") library(multidplyr) ``` <br/> Since it is a backend, you will use `dplyr` verbs (functions) as before. "Due to the overhead associated with communicating between the nodes, you won’t see much performance improvement on basic dplyr verbs with less than ~10 million observations, and you may want to try `dtplyr`, which uses `data.table` instead." <br/> *`multidplyr` requires R 3.5 or greater* --- ## Read multiple data sets Create a cluster: ```r clust <- multidplyr::new_cluster(3) ``` ```r base_url <- "http://www2.stat.duke.edu/~sms185/data/bike/" files <- c("cbs_2015.csv", "cbs_2016.csv", "cbs_2017.csv") ``` -- Read files on each worker within cluster: ```r multidplyr::cluster_assign_each(clust, file_name = str_c(base_url, files)) multidplyr::cluster_send(clust, cbs_data <- vroom::vroom(file_name)) ``` -- Create a partitioned data frame spread across the cluster: ```r cbs <- multidplyr::party_df(clust, "cbs_data") ``` --- ```r cbs ``` .tiny[ ```r Source: party_df [10,277,677 x 9] Shards: 3 [3,185,906--3,757,777 rows] Duration `Start date` `End date` `Start station … `Start station` <dbl> <dttm> <dttm> <dbl> <chr> 1 2389 2015-01-01 00:02:44 2015-01-01 00:42:33 31271 Constitution A… 2 2394 2015-01-01 00:02:46 2015-01-01 00:42:41 31271 Constitution A… 3 468 2015-01-01 00:04:32 2015-01-01 00:12:20 31204 20th & E St NW 4 348 2015-01-01 00:07:18 2015-01-01 00:13:06 31602 Park Rd & Holm… 5 980 2015-01-01 00:09:39 2015-01-01 00:26:00 31247 Jefferson Dr &… 6 932 2015-01-01 00:10:33 2015-01-01 00:26:06 31247 Jefferson Dr &… # … with 1.028e+07 more rows, and 4 more variables: `End station number` <dbl>, `End # station` <chr>, `Bike number` <chr>, `Member type` <chr> ``` ] <br/><br/> Now you are ready to go. --- ## Partition your data If your data already exists in memory, then you can `partition()` it across workers of a cluster. ```r cbs_full <- vroom::vroom(str_c("http://www2.stat.duke.edu/", "~sms185/data/bike/full.csv") ``` ```r clust <- new_cluster(2) cbs_member <- cbs_full %>% group_by(`Member type`) %>% * multidplyr::partition(clust) ``` --- ```r cbs_member ``` .tiny[ ```r Source: party_df [10,277,677 x 9] Groups: Member type Shards: 2 [2,390,682--7,886,995 rows] Duration `Start date` `End date` `Start station … `Start station` <dbl> <dttm> <dttm> <dbl> <chr> 1 2389 2015-01-01 00:02:44 2015-01-01 00:42:33 31271 Constitution A… 2 2394 2015-01-01 00:02:46 2015-01-01 00:42:41 31271 Constitution A… 3 980 2015-01-01 00:09:39 2015-01-01 00:26:00 31247 Jefferson Dr &… 4 932 2015-01-01 00:10:33 2015-01-01 00:26:06 31247 Jefferson Dr &… 5 2646 2015-01-01 00:17:03 2015-01-01 01:01:10 31249 Jefferson Memo… 6 607 2015-01-01 00:18:20 2015-01-01 00:28:27 31104 Adams Mill & C… # … with 1.028e+07 more rows, and 4 more variables: `End station number` <dbl>, `End # station` <chr>, `Bike number` <chr>, `Member type` <chr> ``` ] --- ## Small data example ```r mtcars %>% group_by(cyl) %>% summarise(count = n()) ``` ``` #> # A tibble: 3 x 2 #> cyl count #> <dbl> <int> #> 1 4 11 #> 2 6 7 #> 3 8 14 ``` -- ```r clust <- new_cluster(3) mtcars_cyl <- mtcars %>% group_by(cyl) %>% * partition(clust) ``` --- ```r mtcars_cyl ``` ``` #> Source: party_df [32 x 11] #> Groups: cyl #> Shards: 3 [7--14 rows] #> #> mpg cyl disp hp drat wt qsec vs am gear carb #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 #> 2 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 #> 3 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 #> 4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 #> 5 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 #> 6 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1 #> # … with 26 more rows ``` -- ```r mtcars_cyl %>% summarise(count = n()) ``` ``` #> Source: party_df [3 x 2] #> Shards: 3 [1--1 rows] #> #> cyl count #> <dbl> <int> #> 1 4 11 #> 2 6 7 #> 3 8 14 ``` --- ```r mtcars_cyl %>% summarise(count = n()) %>% * collect() ``` ``` #> # A tibble: 3 x 2 #> cyl count #> <dbl> <int> #> 1 4 11 #> 2 6 7 #> 3 8 14 ``` --- ## Exercise Start with .tiny[ ```r clust <- new_cluster(3) base_url <- "http://www2.stat.duke.edu/~sms185/data/bike/" files <- c("cbs_2015.csv", "cbs_2016.csv", "cbs_2017.csv") cluster_assign_partition(clust, file_name = str_c(base_url, files)) cluster_send(clust, cbs_data <- vroom::vroom(file_name)) cbs <- party_df(clust, "cbs_data") ``` ] 1. Bring the three tibbles together with `collect()`. 2. Fix the names with `janitor::clean_names()`. 3. Add a variable `year`. 4. Partition the tibble onto three clusters grouped by `year`. Check you have `Shards: 3 [3,185,906--3,757,777 rows]`. 5. Use `lubridate::month()`, `lubridate::day()` and `lubridate::wday()` to parse `start_date` and `end_date`, and include the new variables in the tibble. 6. Compute the median ride duration for each year-month-wday combination; bring everything back together as an object named `cbs`. ??? ## Solution .solution[ ```r library(lubridate) cbs <- cbs %>% collect() cbs <- janitor::clean_names(cbs) %>% mutate(year = str_extract(start_date, pattern = "\\d{4}")) clust <- new_cluster(3) cbs_year <- cbs %>% group_by(year) %>% partition(clust) cbs <- cbs_year %>% mutate(start_month = lubridate::month(start_date), start_day = lubridate::day(start_date), start_wday = lubridate::wday(start_date, label = TRUE), end_month = lubridate::month(end_date), end_day = lubridate::day(end_date), end_wday = lubridate::wday(end_date, label = TRUE) ) %>% group_by(year, start_month, start_wday) %>% summarise(med = median(duration)) %>% collect() ``` ] --- ## References - Wickham, H. (2019). Advanced R. https://adv-r.hadley.nz/ - https://cran.r-project.org/web/packages/vroom/vignettes/vroom.html - https://multidplyr.tidyverse.org/