---
title: "Memory and big data"
subtitle: "Statistical Computing & Programming"
author: "Shawn Santo"
institute: ""
date: "06-15-20"
output:
xaringan::moon_reader:
css: "slides.css"
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
editor_options:
chunk_output_type: console
---
```{r include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE,
comment = "#>", highlight = TRUE,
fig.align = "center")
```
## Supplementary materials
Companion videos
- [Memory and atomic vectors](https://warpwire.duke.edu/w/kdcDAA/)
- [Memory and lists](https://warpwire.duke.edu/w/k9cDAA/)
- [Object sizes and input/output](https://warpwire.duke.edu/w/ldcDAA/)
- [Working with package `multidplyr`](https://warpwire.duke.edu/w/l9cDAA/)
Additional resources
- [Chapter 2](https://adv-r.hadley.nz/names-values.html), Advanced R by Wickham, H.
- `vroom` [vignette](https://cran.r-project.org/web/packages/vroom/vignettes/vroom.html)
- `multidplyr` [vignette](https://multidplyr.tidyverse.org/articles/multidplyr.html)
- [Programming with BIG Data in R: Scaling Analytics from One to Thousands of Nodes](https://www.sciencedirect.com/science/article/pii/S221457961630065X)
by Schmidt, D., Chen, W., Matheson, M., & Ostrouchov, G.
---
class: inverse, center, middle
# Memory basics
---
## Names and values
In R, a name has a value. It is not the value that has a name.
For example, in
```{r}
x <- c(-3, 4, 1)
```
the object named `x` is a reference to vector `c(-3, 4, 1)`.
---
We can see where this lives in memory with
```{r}
library(lobstr)
lobstr::obj_addr(x)
```
and its size with
```{r}
lobstr::obj_size(x)
```
---
## Copy-on-modify: atomic vectors
Understanding when R creates a copy of an object will allow you to write
faster code.
```{r}
x <- c(-3, 4, 1)
y <- x
```
--
```{r}
obj_addr(x)
obj_addr(y)
```
---
```{r}
y[3] <- 100
```
--
```{r}
obj_addr(x)
obj_addr(y)
```
---
.pull-left[
```{r}
x <- c(0, 1, 9)
y <- x
obj_addr(x)
obj_addr(y)
```
```{r}
y[4] <- -100
obj_addr(x)
obj_addr(y)
```
]
.pull-right[
]
--
Even though only one component changed in the atomic vector `y`, R created
a new object as seen by the new address in memory.
---
## Memory tracking
Function `tracemem()` marks an object so that a message is printed whenever the
internal code copies the object. Let's see when `x` gets copied.
```{r}
x <- c(0, 1, 1, 2, 3, 5, 8, 13, 21, 34)
tracemem(x)
```
--
```{r}
y <- x
```
--
```{r}
y[1] <- 0
```
---
```{r}
x
y
c(obj_addr(x), obj_addr(y))
x[1] <- 0
ref(x)
ref(y)
untracemem(x)
```
---
## Copy-on-modify: lists
```{r}
x <- list(a = 1, b = 2, c = 3)
obj_addr(x)
```
--
```{r}
y <- x
```
--
```{r}
c(obj_addr(x), obj_addr(y))
```
--
```{r}
ref(x, y)
```
---
```{r}
y$c <- 4
```
--
```{r}
ref(x, y)
```
---
```{r}
x <- list(a = 1, b = 2, c = 3)
y <- x
```
--
```{r}
c(obj_addr(x), obj_addr(y))
```
--
```{r}
y$d <- 9
ref(x, y)
```
R creates a shallow copy. Shared components exist with elements `a`, `b`, and
`c`.
---
## Copy-on-modify: data frames
```{r}
library(tidyverse)
x <- tibble(a = 1:3, b = 9:7)
```
--
```{r}
ref(x)
```
--
```{r}
y <- x %>%
mutate(b = b ^ 2)
```
--
```{r}
ref(x, y)
```
---
```{r}
z <- x
ref(x, z)
```
--
```{r}
z <- x %>%
add_row(a = -1, b = -1)
```
--
```{r}
ref(x, z)
```
--
If you modify a column, only that column needs to be copied in memory. However,
if you modify a row, the entire data frame is copied in memory.
---
## Exercise
Can you diagnose what is going on below?
```{r}
x <- 1:10; y <- x;
tracemem(x)
c(obj_addr(x), obj_addr(y))
y[1] <- 3
```
---
## Object size
Object sizes can sometimes be deceiving.
```{r}
x <- rnorm(1e6)
y <- 1:1e6
z <- seq(1, 1e6, by = 1)
s <- (1:1e6) / 2
```
--
```{r}
c(obj_size(x), obj_size(y), obj_size(z), obj_size(s))
```
---
```{r}
c(obj_size(c(1L)), obj_size(c(1.0)))
```
--
```{r}
c(obj_size(c(1L, 2L)), obj_size(as.numeric(c(1.0, 2.0))))
```
--
```{r}
c(obj_size(c(1L, 2L, 3L)), obj_size(as.numeric(c(1.0, 2.0, 3.0))))
```
--
```{r}
c(obj_size(integer(10000)), obj_size(numeric(10000)))
```
--
There is overhead with creating vectors in R. Take a look at `?Memory` if
you want to dig deeper as to the overhead cost.
---
## Exercise
Starting from 0 we can see that
```{r}
lobstr::obj_size(integer(0))
lobstr::obj_size(numeric(0))
```
are both 48 bytes. Based on the results on the next slide can you deduce how
R handles these numeric data in memory?
---
```{r}
diff(sapply(0:100, function(x) lobstr::obj_size(integer(x))))
```
```{r}
c(obj_size(integer(20)), obj_size(integer(22)))
```
```{r}
diff(sapply(0:100, function(x) lobstr::obj_size(numeric(x))))
```
```{r}
c(obj_size(numeric(10)), obj_size(numeric(14)))
```
---
class: inverse, center, middle
# I/O big data
---
## Getting .small[big] data into R
```{r}
url <- "http://www2.stat.duke.edu/~sms185/data/bike/cbs_2015.csv"
```
.tiny[
```{r eval=FALSE}
system.time({ d <- read.csv(url) })
```
```{r eval=FALSE}
user system elapsed
29.739 1.085 37.321
```
]
--
.tiny[
```{r eval=FALSE}
system.time({ d <- readr::read_csv(url) })
```
```{r eval=FALSE}
Parsed with column specification:
cols(
Duration = col_double(),
`Start date` = col_datetime(format = ""),
`End date` = col_datetime(format = ""),
`Start station number` = col_double(),
`Start station` = col_character(),
`End station number` = col_double(),
`End station` = col_character(),
`Bike number` = col_character(),
`Member type` = col_character()
)
|================================| 100% 369 MB
user system elapsed
12.773 1.727 22.327
```
]
---
.tiny[
```{r eval=FALSE}
system.time({ d <- data.table::fread(url) })
```
```{r eval=FALSE}
trying URL 'http://www2.stat.duke.edu/~sms185/data/bike/cbs_2015.csv'
Content type 'text/csv' length 387899567 bytes (369.9 MB)
==================================================
downloaded 369.9 MB
user system elapsed
7.363 2.009 19.942
```
]
--
.tiny[
```{r eval=FALSE}
system.time({ d <- vroom::vroom(url) })
```
```{r eval=FALSE}
Observations: 3,185,906
Variables: 9
chr [4]: Start station, End station, Bike number, Member type
dbl [3]: Duration, Start station number, End station number
dttm [2]: Start date, End date
Call `spec()` for a copy-pastable column specification
Specify the column types with `col_types` to quiet this message
user system elapsed
5.873 2.361 18.606
```
]
---
## Getting bigger data into R
```{r}
url <- "http://www2.stat.duke.edu/~sms185/data/bike/full.csv"
```
.tiny[
```{r eval=FALSE}
system.time({ d <- read.csv(url) })
```
```{r eval=FALSE}
user system elapsed
119.472 5.037 139.214
```
]
--
.tiny[
```{r eval=FALSE}
system.time({ d <- readr::read_csv(url) })
```
```{r eval=FALSE}
Parsed with column specification:
cols(
Duration = col_double(),
`Start date` = col_datetime(format = ""),
`End date` = col_datetime(format = ""),
`Start station number` = col_double(),
`Start station` = col_character(),
`End station number` = col_double(),
`End station` = col_character(),
`Bike number` = col_character(),
`Member type` = col_character()
)
|================================| 100% 1191 MB
user system elapsed
46.845 7.607 87.425
```
]
---
.tiny[
```{r eval=FALSE}
system.time({ d <- data.table::fread(url) })
```
```{r eval=FALSE}
trying URL 'http://www2.stat.duke.edu/~sms185/data/bike/full.csv'
Content type 'text/csv' length 1249306730 bytes (1191.4 MB)
==================================================
downloaded 1191.4 MB
|--------------------------------------------------|
|==================================================|
user system elapsed
33.402 7.249 79.806
```
]
--
.tiny[
```{r eval=FALSE}
system.time({ d <- vroom::vroom(url) })
```
```{r eval=FALSE}
Observations: 10,277,677
Variables: 9
chr [4]: Start station, End station, Bike number, Member type
dbl [3]: Duration, Start station number, End station number
dttm [2]: Start date, End date
Call `spec()` for a copy-pastable column specification
Specify the column types with `col_types` to quiet this message
user system elapsed
18.837 6.731 57.203
```
]
---
## Summary
| Function | Elapsed Time (s) |
|----------------------:|:------------:|
| `vroom::vroom()` | ~57 |
| `data.table::fread()` | ~80 |
| `readr::read_csv()` | ~87 |
| `read.csv()` | ~139 |
.small[
Observations: 10,277,677
Variables: 9
]
---
class: inverse, center, middle
# Package `multidplyr`
---
## Purpose and getting started
`multidplyr` is a backend for dplyr that partitions a data frame across
multiple cores. This will be valuable if you have to work with massive data
and have the ability to parallelize.
```{r eval=FALSE}
devtools::install_github("tidyverse/multidplyr")
library(multidplyr)
```
```{r echo=FALSE}
library(multidplyr)
```
Since it is a backend, you will use `dplyr` verbs (functions) as before.
"Due to the overhead associated with communicating between the nodes,
you won’t see much performance improvement on basic dplyr verbs with less
than ~10 million observations, and you may want to try `dtplyr`, which uses
`data.table` instead."
*`multidplyr` requires R 3.5 or greater*
---
## Read multiple data sets
Create a cluster:
```{r eval=FALSE}
clust <- multidplyr::new_cluster(3)
```
```{r eval=FALSE}
base_url <- "http://www2.stat.duke.edu/~sms185/data/bike/"
files <- c("cbs_2015.csv", "cbs_2016.csv", "cbs_2017.csv")
```
--
Read files on each worker within cluster:
```{r eval=FALSE}
multidplyr::cluster_assign_each(clust,
file_name = str_c(base_url, files))
multidplyr::cluster_send(clust,
cbs_data <- vroom::vroom(file_name))
```
--
Create a partitioned data frame spread across the cluster:
```{r eval=FALSE}
cbs <- multidplyr::party_df(clust, "cbs_data")
```
---
```{r eval=FALSE}
cbs
```
.tiny[
```{r eval=FALSE}
Source: party_df [10,277,677 x 9]
Shards: 3 [3,185,906--3,757,777 rows]
Duration `Start date` `End date` `Start station … `Start station`
1 2389 2015-01-01 00:02:44 2015-01-01 00:42:33 31271 Constitution A…
2 2394 2015-01-01 00:02:46 2015-01-01 00:42:41 31271 Constitution A…
3 468 2015-01-01 00:04:32 2015-01-01 00:12:20 31204 20th & E St NW
4 348 2015-01-01 00:07:18 2015-01-01 00:13:06 31602 Park Rd & Holm…
5 980 2015-01-01 00:09:39 2015-01-01 00:26:00 31247 Jefferson Dr &…
6 932 2015-01-01 00:10:33 2015-01-01 00:26:06 31247 Jefferson Dr &…
# … with 1.028e+07 more rows, and 4 more variables: `End station number` , `End
# station` , `Bike number` , `Member type`
```
]
Now you are ready to go.
---
## Partition your data
If your data already exists in memory, then you can `partition()` it across
workers of a cluster.
```{r eval=FALSE}
cbs_full <- vroom::vroom(str_c("http://www2.stat.duke.edu/",
"~sms185/data/bike/full.csv")
```
```{r eval=FALSE}
clust <- new_cluster(2)
cbs_member <- cbs_full %>%
group_by(`Member type`) %>%
multidplyr::partition(clust) #<<
```
---
```{r eval=FALSE}
cbs_member
```
.tiny[
```{r eval=FALSE}
Source: party_df [10,277,677 x 9]
Groups: Member type
Shards: 2 [2,390,682--7,886,995 rows]
Duration `Start date` `End date` `Start station … `Start station`
1 2389 2015-01-01 00:02:44 2015-01-01 00:42:33 31271 Constitution A…
2 2394 2015-01-01 00:02:46 2015-01-01 00:42:41 31271 Constitution A…
3 980 2015-01-01 00:09:39 2015-01-01 00:26:00 31247 Jefferson Dr &…
4 932 2015-01-01 00:10:33 2015-01-01 00:26:06 31247 Jefferson Dr &…
5 2646 2015-01-01 00:17:03 2015-01-01 01:01:10 31249 Jefferson Memo…
6 607 2015-01-01 00:18:20 2015-01-01 00:28:27 31104 Adams Mill & C…
# … with 1.028e+07 more rows, and 4 more variables: `End station number` , `End
# station` , `Bike number` , `Member type`
```
]
---
## Small data example
```{r}
mtcars %>%
group_by(cyl) %>%
summarise(count = n())
```
--
```{r}
clust <- new_cluster(3)
mtcars_cyl <- mtcars %>%
group_by(cyl) %>%
partition(clust) #<<
```
---
```{r}
mtcars_cyl
```
--
```{r}
mtcars_cyl %>%
summarise(count = n())
```
---
```{r}
mtcars_cyl %>%
summarise(count = n()) %>%
collect() #<<
```
---
## Exercise
Start with
.tiny[
```{r eval=FALSE}
clust <- new_cluster(3)
base_url <- "http://www2.stat.duke.edu/~sms185/data/bike/"
files <- c("cbs_2015.csv", "cbs_2016.csv", "cbs_2017.csv")
cluster_assign_partition(clust, file_name = str_c(base_url, files))
cluster_send(clust, cbs_data <- vroom::vroom(file_name))
cbs <- party_df(clust, "cbs_data")
```
]
1. Bring the three tibbles together with `collect()`.
2. Fix the names with `janitor::clean_names()`.
3. Add a variable `year`.
4. Partition the tibble onto three clusters grouped by `year`. Check you have
`Shards: 3 [3,185,906--3,757,777 rows]`.
5. Use `lubridate::month()`, `lubridate::day()` and `lubridate::wday()`
to parse `start_date` and `end_date`, and include the new variables in
the tibble.
6. Compute the median ride duration for each year-month-wday combination; bring
everything back together as an object named `cbs`.
???
## Solution
.solution[
```{r eval=FALSE}
library(lubridate)
cbs <- cbs %>%
collect()
cbs <- janitor::clean_names(cbs) %>%
mutate(year = str_extract(start_date, pattern = "\\d{4}"))
clust <- new_cluster(3)
cbs_year <- cbs %>%
group_by(year) %>%
partition(clust)
cbs <- cbs_year %>%
mutate(start_month = lubridate::month(start_date),
start_day = lubridate::day(start_date),
start_wday = lubridate::wday(start_date, label = TRUE),
end_month = lubridate::month(end_date),
end_day = lubridate::day(end_date),
end_wday = lubridate::wday(end_date, label = TRUE)
) %>%
group_by(year, start_month, start_wday) %>%
summarise(med = median(duration)) %>%
collect()
```
]
---
## References
- Wickham, H. (2019). Advanced R. https://adv-r.hadley.nz/
- https://cran.r-project.org/web/packages/vroom/vignettes/vroom.html
- https://multidplyr.tidyverse.org/