Memory and big data

# Memory and big data
## Statistical Computing & Programming
### Shawn Santo
### 06-15-20

---

## Supplementary materials

Companion videos

- [Memory and atomic vectors](https://warpwire.duke.edu/w/kdcDAA/)
- [Memory and lists](https://warpwire.duke.edu/w/k9cDAA/)
- [Object sizes and input/output](https://warpwire.duke.edu/w/ldcDAA/)
- [Working with package `multidplyr`](https://warpwire.duke.edu/w/l9cDAA/)

Additional resources

- [Chapter 2](https://adv-r.hadley.nz/names-values.html), Advanced R by Wickham, H.
- `vroom` [vignette](https://cran.r-project.org/web/packages/vroom/vignettes/vroom.html)
- `multidplyr` [vignette](https://multidplyr.tidyverse.org/articles/multidplyr.html)
- [Programming with BIG Data in R: Scaling Analytics from One to Thousands of Nodes](https://www.sciencedirect.com/science/article/pii/S221457961630065X)
    by Schmidt, D., Chen, W., Matheson, M., & Ostrouchov, G.

---

# Memory basics

---

## Names and values

In R, a name has a value. It is not the value that has a name.

For example, in

```r
x <- c(-3, 4, 1)
```

the object named `x` is a reference to vector `c(-3, 4, 1)`.

---

We can see where this lives in memory with

```r
library(lobstr)
lobstr::obj_addr(x)
```

```
#> [1] "0x7f8be6657298"
```

and its size with

```r
lobstr::obj_size(x)
```

```
#> 80 B
```

---

## Copy-on-modify: atomic vectors

Understanding when R creates a copy of an object will allow you to write
faster code.

```r
x <- c(-3, 4, 1)
y <- x
```

```r
obj_addr(x)
```

```
#> [1] "0x7f8be65cfd28"
```

```r
obj_addr(y)
```

```
#> [1] "0x7f8be65cfd28"
```

---

```r
y[3] <- 100
```

```r
obj_addr(x)
```

```
#> [1] "0x7f8be65cfd28"
```

```r
obj_addr(y)
```

```
#> [1] "0x7f8be23d9c98"
```

---

```r
x <- c(0, 1, 9)
y <- x

obj_addr(x)
```

```
#> [1] "0x7f8be38c83a8"
```

```r
obj_addr(y)
```

```
#> [1] "0x7f8be38c83a8"
```

```r
y[4] <- -100
obj_addr(x)
```

```
#> [1] "0x7f8be38c83a8"
```

```r
obj_addr(y)
```

```
#> [1] "0x7f8be234f5c8"
```
]
.pull-right[
 
<center>
<img src="images/name_bind4.png">
</center>

<center>
<img src="images/name_bind5.png">
</center>
]

Even though only one component changed in the atomic vector `y`, R created
a new object as seen by the new address in memory.

---

## Memory tracking

Function `tracemem()` marks an object so that a message is printed whenever the 
internal code copies the object. Let's see when `x` gets copied.

```r
x <- c(0, 1, 1, 2, 3, 5, 8, 13, 21, 34)
tracemem(x)
```

```
#> [1] "<0x7f8be23dd918>"
```

```r
y <- x
```

```r
y[1] <- 0
```

```
#> tracemem[0x7f8be23dd918 -> 0x7f8be24c9188]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous>
```

---

```r
x
```

```
#>  [1]  0  1  1  2  3  5  8 13 21 34
```

```r
y
```

```
#>  [1]  0  1  1  2  3  5  8 13 21 34
```

```r
c(obj_addr(x), obj_addr(y))
```

```
#> [1] "0x7f8be23dd918" "0x7f8be24c9188"
```

```r
x[1] <- 0
```

```
#> tracemem[0x7f8be23dd918 -> 0x7f8be22fbb88]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous>
```

```r
ref(x)
```

```
#> [1:0x7f8be22fbb88] <dbl>
```

```r
ref(y)
```

```
#> [1:0x7f8be24c9188] <dbl>
```

```r
untracemem(x)
```

---

## Copy-on-modify: lists

```r
x <- list(a = 1, b = 2, c = 3)
obj_addr(x)
```

```
#> [1] "0x7f8be2382498"
```

```r
y <- x
```

```r
c(obj_addr(x), obj_addr(y))
```

```
#> [1] "0x7f8be2382498" "0x7f8be2382498"
```

```r
ref(x, y)
```

```
#> █ [1:0x7f8be2382498] <named list> 
#> ├─a = [2:0x7f8be667acf8] <dbl> 
#> ├─b = [3:0x7f8be667acc0] <dbl> 
#> └─c = [4:0x7f8be667ac88] <dbl> 
#> 
#> [1:0x7f8be2382498]
```

---

```r
y$c <- 4
```

```r
ref(x, y)
```

```
#> █ [1:0x7f8be2382498] <named list> 
#> ├─a = [2:0x7f8be667acf8] <dbl> 
#> ├─b = [3:0x7f8be667acc0] <dbl> 
#> └─c = [4:0x7f8be667ac88] <dbl> 
#> 
#> █ [5:0x7f8be9b42818] <named list> 
#> ├─a = [2:0x7f8be667acf8] 
#> ├─b = [3:0x7f8be667acc0] 
#> └─c = [6:0x7f8be9b15790] <dbl>
```

---

```r
x <- list(a = 1, b = 2, c = 3)
y <- x
```

```r
c(obj_addr(x), obj_addr(y))
```

```
#> [1] "0x7f8be9cc3898" "0x7f8be9cc3898"
```

```r
y$d <- 9
ref(x, y)
```

```
#> █ [1:0x7f8be9cc3898] <named list> 
#> ├─a = [2:0x7f8be9c33110] <dbl> 
#> ├─b = [3:0x7f8be9c330d8] <dbl> 
#> └─c = [4:0x7f8be9c330a0] <dbl> 
#> 
#> █ [5:0x7f8be7ff0b98] <named list> 
#> ├─a = [2:0x7f8be9c33110] 
#> ├─b = [3:0x7f8be9c330d8] 
#> ├─c = [4:0x7f8be9c330a0] 
#> └─d = [6:0x7f8be9e87120] <dbl>
```

R creates a shallow copy. Shared components exist with elements `a`, `b`, and
`c`.

---

## Copy-on-modify: data frames

```r
library(tidyverse)
x <- tibble(a = 1:3, b = 9:7)
```

```r
ref(x)
```

```
#> █ [1:0x7f8beadb80c8] <tibble> 
#> ├─a = [2:0x7f8bea08c9e8] <int> 
#> └─b = [3:0x7f8bea093f20] <int>
```

```r
y <- x %>% 
 mutate(b = b ^ 2)
```

```r
ref(x, y)
```

```
#> █ [1:0x7f8beadb80c8] <tibble> 
#> ├─a = [2:0x7f8bea08c9e8] <int> 
#> └─b = [3:0x7f8bea093f20] <int> 
#> 
#> █ [4:0x7f8be73ee748] <tibble> 
#> ├─a = [2:0x7f8bea08c9e8] 
#> └─b = [5:0x7f8beb2871b8] <dbl>
```

---

```r
z <- x
ref(x, z)
```

```
#> █ [1:0x7f8beadb80c8] <tibble> 
#> ├─a = [2:0x7f8bea08c9e8] <int> 
#> └─b = [3:0x7f8bea093f20] <int> 
#> 
#> [1:0x7f8beadb80c8]
```

```r
z <- x %>% 
 add_row(a = -1, b = -1)
```

```r
ref(x, z)
```

```
#> █ [1:0x7f8beadb80c8] <tibble> 
#> ├─a = [2:0x7f8bea08c9e8] <int> 
#> └─b = [3:0x7f8bea093f20] <int> 
#> 
#> █ [4:0x7f8beb311fc8] <tibble> 
#> ├─a = [5:0x7f8be9305e58] <dbl> 
#> └─b = [6:0x7f8be9305db8] <dbl>
```

If you modify a column, only that column needs to be copied in memory. However,
if you modify a row, the entire data frame is copied in memory.

---

## Exercise

Can you diagnose what is going on below?

```r
x <- 1:10; y <- x;

tracemem(x)
```

```
#> [1] "<0x7f8be9a6b6e8>"
```

```r
c(obj_addr(x), obj_addr(y))
```

```
#> [1] "0x7f8be9a6b6e8" "0x7f8be9a6b6e8"
```

```r
y[1] <- 3
```

```
#> tracemem[0x7f8be9a6b6e8 -> 0x7f8be92a9918]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> 
#> tracemem[0x7f8be92a9918 -> 0x7f8be9295ee8]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous>
```

---

## Object size

Object sizes can sometimes be deceiving.

```r
x <- rnorm(1e6)
y <- 1:1e6
z <- seq(1, 1e6, by = 1)
s <- (1:1e6) / 2
```

```r
c(obj_size(x), obj_size(y), obj_size(z), obj_size(s))
```

```
#> * 8,000,048 B
#> *       680 B
#> * 8,000,048 B
#> * 8,000,048 B
```

---

```r
c(obj_size(c(1L)), obj_size(c(1.0)))
```

```
#> * 56 B
#> * 56 B
```

```r
c(obj_size(c(1L, 2L)), obj_size(as.numeric(c(1.0, 2.0))))
```

```
#> * 56 B
#> * 64 B
```

```r
c(obj_size(c(1L, 2L, 3L)), obj_size(as.numeric(c(1.0, 2.0, 3.0))))
```

```
#> * 64 B
#> * 80 B
```

```r
c(obj_size(integer(10000)), obj_size(numeric(10000)))
```

```
#> * 40,048 B
#> * 80,048 B
```

There is overhead with creating vectors in R. Take a look at `?Memory` if
you want to dig deeper as to the overhead cost.

---

## Exercise

Starting from 0 we can see that

```r
lobstr::obj_size(integer(0))
```

```
#> 48 B
```

```r
lobstr::obj_size(numeric(0))
```

```
#> 48 B
```

are both 48 bytes. Based on the results on the next slide can you deduce how 
R handles these numeric data in memory?

---

```r
diff(sapply(0:100, function(x) lobstr::obj_size(integer(x))))
```

```
#>   [1]  8  0  8  0 16  0  0  0 16  0  0  0 16  0  0  0 64  0  0  0  0  0  0
#>  [24]  0  0  0  0  0  0  0  0  0  8  0  8  0  8  0  8  0  8  0  8  0  8  0
#>  [47]  8  0  8  0  8  0  8  0  8  0  8  0  8  0  8  0  8  0  8  0  8  0  8
#>  [70]  0  8  0  8  0  8  0  8  0  8  0  8  0  8  0  8  0  8  0  8  0  8  0
#>  [93]  8  0  8  0  8  0  8  0
```

```r
c(obj_size(integer(20)), obj_size(integer(22)))
```

```
#> * 176 B
#> * 176 B
```

```r
diff(sapply(0:100, function(x) lobstr::obj_size(numeric(x))))
```

```
#>   [1]  8  8 16  0 16  0 16  0 64  0  0  0  0  0  0  0  8  8  8  8  8  8  8
#>  [24]  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
#>  [47]  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
#>  [70]  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
#>  [93]  8  8  8  8  8  8  8  8
```

```r
c(obj_size(numeric(10)), obj_size(numeric(14)))
```

```
#> * 176 B
#> * 176 B
```

---

# I/O big data

---

## Getting .small[big] data into R

```r
url <- "http://www2.stat.duke.edu/~sms185/data/bike/cbs_2015.csv"
```

```r
system.time({ d <- read.csv(url) })
```

```r
   user  system elapsed 
 29.739   1.085  37.321 
```
]

```r
system.time({ d <- readr::read_csv(url) })
```

```r
Parsed with column specification:
cols(
  Duration = col_double(),
  `Start date` = col_datetime(format = ""),
  `End date` = col_datetime(format = ""),
  `Start station number` = col_double(),
  `Start station` = col_character(),
  `End station number` = col_double(),
  `End station` = col_character(),
  `Bike number` = col_character(),
  `Member type` = col_character()
)
|================================| 100%  369 MB
   user  system elapsed 
 12.773   1.727  22.327 
```
]

---

```r
system.time({ d <- data.table::fread(url) })
```

```r
trying URL 'http://www2.stat.duke.edu/~sms185/data/bike/cbs_2015.csv'
Content type 'text/csv' length 387899567 bytes (369.9 MB)
==================================================
downloaded 369.9 MB

user  system elapsed 
  7.363   2.009  19.942 
```
]

```r
system.time({ d <- vroom::vroom(url) })
```

```r
Observations: 3,185,906                                                                                                                      
Variables: 9
chr  [4]: Start station, End station, Bike number, Member type
dbl  [3]: Duration, Start station number, End station number
dttm [2]: Start date, End date

Call `spec()` for a copy-pastable column specification
Specify the column types with `col_types` to quiet this message

user  system elapsed 
  5.873   2.361  18.606 
```
]

---

## Getting bigger data into R

```r
url <- "http://www2.stat.duke.edu/~sms185/data/bike/full.csv"
```

```r
system.time({ d <- read.csv(url) })
```

```r
   user  system elapsed 
119.472   5.037 139.214 
```
]

```r
system.time({ d <- readr::read_csv(url) })
```

```r
Parsed with column specification:
cols(
  Duration = col_double(),
  `Start date` = col_datetime(format = ""),
  `End date` = col_datetime(format = ""),
  `Start station number` = col_double(),
  `Start station` = col_character(),
  `End station number` = col_double(),
  `End station` = col_character(),
  `Bike number` = col_character(),
  `Member type` = col_character()
)
|================================| 100%  1191 MB
   user  system elapsed 
 46.845   7.607  87.425 
```
]

---

```r
system.time({ d <- data.table::fread(url) })
```

```r
trying URL 'http://www2.stat.duke.edu/~sms185/data/bike/full.csv'
Content type 'text/csv' length 1249306730 bytes (1191.4 MB)
==================================================
downloaded 1191.4 MB

|--------------------------------------------------|
|==================================================|
   user  system elapsed 
 33.402   7.249  79.806 
```
]

```r
system.time({ d <- vroom::vroom(url) })
```

```r
Observations: 10,277,677                                                                                                                     
Variables: 9
chr  [4]: Start station, End station, Bike number, Member type
dbl  [3]: Duration, Start station number, End station number
dttm [2]: Start date, End date

Call `spec()` for a copy-pastable column specification
Specify the column types with `col_types` to quiet this message
   user  system elapsed 
 18.837   6.731  57.203 
```
]

---

## Summary

| Function | Elapsed Time (s) |
|----------------------:|:------------:|
| `vroom::vroom()` | ~57 |
| `data.table::fread()` | ~80 |
| `readr::read_csv()` | ~87 |
| `read.csv()` | ~139 |
 
.small[
Observations: 10,277,677

Variables: 9
]

---

# Package `multidplyr`

---

## Purpose and getting started

`multidplyr` is a backend for dplyr that partitions a data frame across 
multiple cores. This will be valuable if you have to work with massive data
and have the ability to parallelize.

```r
devtools::install_github("tidyverse/multidplyr")
library(multidplyr)
```

Since it is a backend, you will use `dplyr` verbs (functions) as before.

"Due to the overhead associated with communicating between the nodes, 
you won’t see much performance improvement on basic dplyr verbs with less 
than ~10 million observations, and you may want to try `dtplyr`, which uses 
`data.table` instead."

*`multidplyr` requires R 3.5 or greater*

---

## Read multiple data sets

Create a cluster:

```r
clust <- multidplyr::new_cluster(3)
```

```r
base_url <- "http://www2.stat.duke.edu/~sms185/data/bike/"
files <- c("cbs_2015.csv", "cbs_2016.csv", "cbs_2017.csv")
```

Read files on each worker within cluster:

```r
multidplyr::cluster_assign_each(clust, 
                                file_name = str_c(base_url, files))

multidplyr::cluster_send(clust, 
 cbs_data <- vroom::vroom(file_name))
```

Create a partitioned data frame spread across the cluster:

```r
cbs <- multidplyr::party_df(clust, "cbs_data")
```

---

```r
cbs
```

```r
Source: party_df [10,277,677 x 9]
Shards: 3 [3,185,906--3,757,777 rows]

Duration `Start date` `End date` `Start station … `Start station`
 <dbl> <dttm> <dttm> <dbl> <chr> 
1 2389 2015-01-01 00:02:44 2015-01-01 00:42:33 31271 Constitution A…
2 2394 2015-01-01 00:02:46 2015-01-01 00:42:41 31271 Constitution A…
3 468 2015-01-01 00:04:32 2015-01-01 00:12:20 31204 20th & E St NW 
4 348 2015-01-01 00:07:18 2015-01-01 00:13:06 31602 Park Rd & Holm…
5 980 2015-01-01 00:09:39 2015-01-01 00:26:00 31247 Jefferson Dr &…
6 932 2015-01-01 00:10:33 2015-01-01 00:26:06 31247 Jefferson Dr &…
# … with 1.028e+07 more rows, and 4 more variables: `End station number` <dbl>, `End
# station` <chr>, `Bike number` <chr>, `Member type` <chr>
```
]

Now you are ready to go.

---

## Partition your data

If your data already exists in memory, then you can `partition()` it across 
workers of a cluster.

```r
cbs_full <- vroom::vroom(str_c("http://www2.stat.duke.edu/",
 "~sms185/data/bike/full.csv")
```

```r
clust <- new_cluster(2)

cbs_member <- cbs_full %>% 
 group_by(`Member type`) %>% 
* multidplyr::partition(clust)
```

---

```r
cbs_member
```

```r
Source: party_df [10,277,677 x 9]
Groups: Member type
Shards: 2 [2,390,682--7,886,995 rows]

Duration `Start date` `End date` `Start station … `Start station`
 <dbl> <dttm> <dttm> <dbl> <chr> 
1 2389 2015-01-01 00:02:44 2015-01-01 00:42:33 31271 Constitution A…
2 2394 2015-01-01 00:02:46 2015-01-01 00:42:41 31271 Constitution A…
3 980 2015-01-01 00:09:39 2015-01-01 00:26:00 31247 Jefferson Dr &…
4 932 2015-01-01 00:10:33 2015-01-01 00:26:06 31247 Jefferson Dr &…
5 2646 2015-01-01 00:17:03 2015-01-01 01:01:10 31249 Jefferson Memo…
6 607 2015-01-01 00:18:20 2015-01-01 00:28:27 31104 Adams Mill & C…
# … with 1.028e+07 more rows, and 4 more variables: `End station number` <dbl>, `End
# station` <chr>, `Bike number` <chr>, `Member type` <chr>
```

]

---

## Small data example

```r
mtcars %>%
  group_by(cyl) %>% 
  summarise(count = n())
```

```
#> # A tibble: 3 x 2
#> cyl count
#> <dbl> <int>
#> 1 4 11
#> 2 6 7
#> 3 8 14
```

```r
clust <- new_cluster(3)

mtcars_cyl <- mtcars %>% 
 group_by(cyl) %>% 
* partition(clust)
```

---

```r
mtcars_cyl
```

```
#> Source: party_df [32 x 11]
#> Groups: cyl
#> Shards: 3 [7--14 rows]
#> 
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 2 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 3 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
#> 5 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
#> 6 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
#> # … with 26 more rows
```

```r
mtcars_cyl %>% 
  summarise(count = n())
```

```
#> Source: party_df [3 x 2]
#> Shards: 3 [1--1 rows]
#> 
#> cyl count
#> <dbl> <int>
#> 1 4 11
#> 2 6 7
#> 3 8 14
```

---

```r
mtcars_cyl %>% 
  summarise(count = n()) %>% 
* collect()
```

```
#> # A tibble: 3 x 2
#> cyl count
#> <dbl> <int>
#> 1 4 11
#> 2 6 7
#> 3 8 14
```

---

## Exercise

Start with

```r
clust <- new_cluster(3)

base_url <- "http://www2.stat.duke.edu/~sms185/data/bike/"
files <- c("cbs_2015.csv", "cbs_2016.csv", "cbs_2017.csv")

cluster_assign_partition(clust, file_name = str_c(base_url, files))
cluster_send(clust, cbs_data <- vroom::vroom(file_name))

cbs <- party_df(clust, "cbs_data")
```
]

1. Bring the three tibbles together with `collect()`.

2. Fix the names with `janitor::clean_names()`.

3. Add a variable `year`.

4. Partition the tibble onto three clusters grouped by `year`. Check you have
   `Shards: 3 [3,185,906--3,757,777 rows]`.

5. Use `lubridate::month()`, `lubridate::day()` and `lubridate::wday()`
   to parse `start_date` and `end_date`, and include the new variables in
   the tibble.
   
6. Compute the median ride duration for each year-month-wday combination; bring
   everything back together as an object named `cbs`.
   
???

## Solution

```r
library(lubridate)

cbs <- cbs %>% 
 collect()

cbs <- janitor::clean_names(cbs) %>% 
 mutate(year = str_extract(start_date, pattern = "\\d{4}"))

clust <- new_cluster(3)
cbs_year <- cbs %>% 
 group_by(year) %>% 
 partition(clust)

cbs <- cbs_year %>% 
 mutate(start_month = lubridate::month(start_date),
 start_day = lubridate::day(start_date),
 start_wday = lubridate::wday(start_date, label = TRUE),
 end_month = lubridate::month(end_date),
 end_day = lubridate::day(end_date),
 end_wday = lubridate::wday(end_date, label = TRUE)
 ) %>% 
 group_by(year, start_month, start_wday) %>% 
 summarise(med = median(duration)) %>% 
 collect()
```

]

---

## References

- Wickham, H. (2019). Advanced R. https://adv-r.hadley.nz/

- https://cran.r-project.org/web/packages/vroom/vignettes/vroom.html

- https://multidplyr.tidyverse.org/