---
title: "Working with big data"
subtitle: "Programming for Statistical Science"
author: "Shawn Santo"
institute: ""
date: ""
output:
xaringan::moon_reader:
css: "slides.css"
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
editor_options:
chunk_output_type: console
---
```{r include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE,
comment = "#>", highlight = TRUE,
fig.align = "center")
```
## Supplementary materials
Full video lecture available in Zoom Cloud Recordings
Additional resources
- [Chapter 2](https://adv-r.hadley.nz/names-values.html), Advanced R by Wickham, H.
- `vroom` [vignette](https://cran.r-project.org/web/packages/vroom/vignettes/vroom.html)
---
class: inverse, center, middle
# Memory basics
---
## Names and values
In R, a name has a value. It is not the value that has a name.
For example, in
```{r}
x <- c(-3, 4, 1)
```
the object named `x` is a reference to vector `c(-3, 4, 1)`.
---
We can see where this lives in memory with
```{r}
library(lobstr)
lobstr::obj_addr(x)
```
and its size with
```{r}
lobstr::obj_size(x)
```
---
## Copy-on-modify: atomic vectors
Understanding when R creates a copy of an object will allow you to write
faster code. This is also important to keep in mind when working with very
large vectors.
```{r}
x <- c(-3, 4, 1)
y <- x
```
--
```{r}
obj_addr(x)
obj_addr(y)
```
---
```{r}
y[3] <- 100
```
--
```{r}
obj_addr(x)
obj_addr(y)
```
---
.pull-left[
```{r}
x <- c(0, 1, 9)
y <- x
obj_addr(x)
obj_addr(y)
```
```{r}
y[4] <- -100
obj_addr(x)
obj_addr(y)
```
]
.pull-right[
]
--
Even though only one component changed in the atomic vector `y`, R created
a new object as seen by the new address in memory.
???
## Copy-on-modify and loops
Poor loop implementation
.tiny[
```{r eval=FALSE}
n <- 8
x <- 1
for (i in seq_len(n)) {
cat("Object address start iteration", i, ":", obj_addr(x), "\n")
x <- c(x, sqrt(x[i] * i))
cat("Object address end iteration ", i, ":", obj_addr(x), "\n\n")
}
```
]
"Efficient" loop implementation
.tiny[
```{r eval=FALSE}
n <- 8
x <- rep(1, n + 1)
ref(x)
for (i in seq_len(n)) {
cat("Object address start iteration", i, ":", ref(x), "\n")
x[i + 1] <- mean(x[i] * i)
cat("Object address end iteration ", i, ":", ref(x), "\n\n")
}
```
]
---
## Memory tracking
Function `tracemem()` marks an object so that a message is printed whenever the
internal code copies the object. Let's see when `x` gets copied.
```{r}
x <- c(0, 1, 1, 2, 3, 5, 8, 13, 21, 34)
tracemem(x)
```
--
```{r}
y <- x
```
--
```{r}
y[1] <- 0
```
---
```{r}
x
y
c(obj_addr(x), obj_addr(y))
```
--
```{r}
x[1] <- 0
```
--
```{r}
lobstr::ref(x)
lobstr::ref(y)
untracemem(x)
```
---
## Copy-on-modify: lists
```{r}
x <- list(a = 1, b = 2, c = 3)
obj_addr(x)
```
--
```{r}
y <- x
```
--
```{r}
c(obj_addr(x), obj_addr(y))
```
--
```{r}
ref(x, y)
```
---
```{r}
y$c <- 4
```
--
```{r}
ref(x, y)
```
---
```{r}
x <- list(a = 1, b = 2, c = 3)
y <- x
```
--
```{r}
c(obj_addr(x), obj_addr(y))
```
--
```{r}
y$d <- 9
ref(x, y)
```
R creates a shallow copy. Shared components exist with elements `a`, `b`, and
`c`.
---
## Copy-on-modify: data frames
```{r}
library(tidyverse)
x <- tibble(a = 1:3, b = 9:7)
```
--
```{r}
ref(x)
```
--
```{r}
y <- x %>%
mutate(b = b ^ 2)
```
--
```{r}
ref(x, y)
```
---
```{r}
z <- x
ref(x, z)
```
--
```{r}
z <- x %>%
add_row(a = -1, b = -1)
```
--
```{r}
ref(x, z)
```
--
If you modify a column, only that column needs to be copied in memory. However,
if you modify a row, the entire data frame is copied in memory.
---
## Exercise
Can you diagnose what is going on below?
```{r}
x <- 1:10; y <- x;
tracemem(x)
c(obj_addr(x), obj_addr(y))
y[1] <- 3
```
---
## Object size
Object sizes can sometimes be deceiving.
```{r}
x <- rnorm(1e6)
y <- 1:1e6
z <- seq(1, 1e6, by = 1)
s <- (1:1e6) / 2
```
--
```{r}
c(obj_size(x), obj_size(y), obj_size(z), obj_size(s))
```
---
```{r}
c(obj_size(c(1L)), obj_size(c(1.0)))
```
--
```{r}
c(obj_size(c(1L, 2L)), obj_size(as.numeric(c(1.0, 2.0))))
```
--
```{r}
c(obj_size(c(1L, 2L, 3L)), obj_size(as.numeric(c(1.0, 2.0, 3.0))))
```
--
```{r}
c(obj_size(integer(10000)), obj_size(numeric(10000)))
```
--
There is overhead with creating vectors in R. Take a look at `?Memory` if
you want to dig deeper as to the overhead cost.
---
## Exercise
Starting from 0 we can see that
```{r}
lobstr::obj_size(integer(0))
lobstr::obj_size(numeric(0))
```
are both 48 bytes. Based on the results on the next slide can you deduce how
R handles these numeric data in memory?
---
```{r}
diff(sapply(0:100, function(x) lobstr::obj_size(integer(x))))
```
```{r}
c(obj_size(integer(20)), obj_size(integer(22)))
```
```{r}
diff(sapply(0:100, function(x) lobstr::obj_size(numeric(x))))
```
```{r}
c(obj_size(numeric(10)), obj_size(numeric(14)))
```
---
class: inverse, center, middle
# I/O big data
---
## Getting big data into R
Dimensions: 3,185,906 x 9
```{r}
url <- "http://www2.stat.duke.edu/~sms185/data/bike/cbs_2015.csv"
```
.tiny[
```{r eval=FALSE}
system.time({x <- read.csv(url)})
```
```{r eval=FALSE}
user system elapsed
*29.739 1.085 37.321
```
]
--
.tiny[
```{r eval=FALSE}
system.time({x <- readr::read_csv(url)})
```
```{r eval=FALSE}
Parsed with column specification:
cols(
Duration = col_double(),
`Start date` = col_datetime(format = ""),
`End date` = col_datetime(format = ""),
`Start station number` = col_double(),
`Start station` = col_character(),
`End station number` = col_double(),
`End station` = col_character(),
`Bike number` = col_character(),
`Member type` = col_character()
)
|================================| 100% 369 MB
user system elapsed
*12.773 1.727 22.327
```
]
---
.tiny[
```{r eval=FALSE}
system.time({x <- data.table::fread(url)})
```
```{r eval=FALSE}
trying URL 'http://www2.stat.duke.edu/~sms185/data/bike/cbs_2015.csv'
Content type 'text/csv' length 387899567 bytes (369.9 MB)
==================================================
downloaded 369.9 MB
user system elapsed
* 7.363 2.009 19.942
```
]
--
.tiny[
```{r eval=FALSE}
system.time({x <- vroom::vroom(url)})
```
```{r eval=FALSE}
Observations: 3,185,906
Variables: 9
chr [4]: Start station, End station, Bike number, Member type
dbl [3]: Duration, Start station number, End station number
dttm [2]: Start date, End date
Call `spec()` for a copy-pastable column specification
Specify the column types with `col_types` to quiet this message
user system elapsed
* 5.873 2.361 18.606
```
]
---
## Getting bigger data into R
Dimensions: 10,277,677 x 9
```{r}
url <- "http://www2.stat.duke.edu/~sms185/data/bike/full.csv"
```
.tiny[
```{r eval=FALSE}
system.time({x <- read.csv(url)})
```
```{r eval=FALSE}
user system elapsed
*119.472 5.037 139.214
```
]
--
.tiny[
```{r eval=FALSE}
system.time({x <- readr::read_csv(url)})
```
```{r eval=FALSE}
Parsed with column specification:
cols(
Duration = col_double(),
`Start date` = col_datetime(format = ""),
`End date` = col_datetime(format = ""),
`Start station number` = col_double(),
`Start station` = col_character(),
`End station number` = col_double(),
`End station` = col_character(),
`Bike number` = col_character(),
`Member type` = col_character()
)
|================================| 100% 1191 MB
user system elapsed
*46.845 7.607 87.425
```
]
---
.tiny[
```{r eval=FALSE}
system.time({x <- data.table::fread(url)})
```
```{r eval=FALSE}
trying URL 'http://www2.stat.duke.edu/~sms185/data/bike/full.csv'
Content type 'text/csv' length 1249306730 bytes (1191.4 MB)
==================================================
downloaded 1191.4 MB
|--------------------------------------------------|
|==================================================|
user system elapsed
*33.402 7.249 79.806
```
]
--
.tiny[
```{r eval=FALSE}
system.time({x <- vroom::vroom(url)})
```
```{r eval=FALSE}
Observations: 10,277,677
Variables: 9
chr [4]: Start station, End station, Bike number, Member type
dbl [3]: Duration, Start station number, End station number
dttm [2]: Start date, End date
Call `spec()` for a copy-pastable column specification
Specify the column types with `col_types` to quiet this message
user system elapsed
*18.837 6.731 57.203
```
]
---
## Summary
| Function | Elapsed Time (s) |
|----------------------:|:------------:|
| `vroom::vroom()` | ~57 |
| `data.table::fread()` | ~80 |
| `readr::read_csv()` | ~87 |
| `read.csv()` | ~139 |
.small[
Observations: 10,277,677
Variables: 9
]
---
class: inverse, center, middle
# Wrangling big data
---
## Package `dtplyr`
`dtplyr` provides a `data.table` backend for `dplyr`. The goal of `dtplyr` is
to allow you to write dplyr code that is automatically translated to the
equivalent, but usually much faster, `data.table` code.
```{r}
library(dtplyr)
library(tidyverse)
```
Since it is a backend, you will use `dplyr` verbs (functions) as before.
---
## Get big data
.tiny[
```{r eval=FALSE}
base_url <- "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-"
month_ext <- str_pad(1:12, width = 2, pad = "0")
urls <- str_c(base_url, month_ext, ".csv", sep = "")
taxi_2019 <- map_df(urls, vroom)
```
]
*Caution:* this full dataset is a dataframe of 84,399,019 x 18.
.tiny[
```{r eval=FALSE}
# A tibble: 84,399,019 x 18
VendorID tpep_pickup_dat… tpep_dropoff_da… passenger_count trip_distance RatecodeID
1 1 2019-01-01 00:4… 2019-01-01 00:5… 1 1.5 1
2 1 2019-01-01 00:5… 2019-01-01 01:1… 1 2.6 1
3 2 2018-12-21 13:4… 2018-12-21 13:5… 3 0 1
4 2 2018-11-28 15:5… 2018-11-28 15:5… 5 0 1
5 2 2018-11-28 15:5… 2018-11-28 15:5… 5 0 2
6 2 2018-11-28 16:2… 2018-11-28 16:2… 5 0 1
7 2 2018-11-28 16:2… 2018-11-28 16:3… 5 0 2
8 1 2019-01-01 00:2… 2019-01-01 00:2… 1 1.3 1
9 1 2019-01-01 00:3… 2019-01-01 00:4… 1 3.7 1
10 1 2019-01-01 00:5… 2019-01-01 01:0… 2 2.1 1
# … with 84,399,009 more rows, and 12 more variables: store_and_fwd_flag ,
# PULocationID , DOLocationID , payment_type , fare_amount , extra ,
# mta_tax , tip_amount , tolls_amount , improvement_surcharge ,
# total_amount , congestion_surcharge
```
]
---
## Time comparison
Using `dplyr`
.tiny[
```{r eval=FALSE}
system.time({
taxi_2019 %>%
mutate(pickup_datetime = as_datetime(tpep_pickup_datetime),
dropoff_datetime = as_datetime(tpep_dropoff_datetime),
pickup_month = month(pickup_datetime, label = TRUE),
pickup_day = wday(pickup_datetime, label = TRUE)) %>%
group_by(pickup_month, pickup_day) %>%
summarise(mean_trip_distance = mean(trip_distance))
})
user system elapsed
*339.326 21.729 444.383
```
]
--
Using `dtplyr`
.tiny[
```{r eval=FALSE}
taxi_2019_lazy <- lazy_dt(taxi_2019) #<<
system.time({
taxi_2019_lazy %>%
mutate(pickup_datetime = as_datetime(tpep_pickup_datetime),
dropoff_datetime = as_datetime(tpep_dropoff_datetime),
pickup_month = month(pickup_datetime, label = TRUE),
pickup_day = wday(pickup_datetime, label = TRUE)) %>%
group_by(pickup_month, pickup_day) %>%
summarise(mean_trip_distance = mean(trip_distance)) %>%
as_tibble() #<<
})
user system elapsed
*384.199 47.111 530.458
```
]
---
## What's the point of this package?
The benefit comes when
1. you have many many groups (millions);
2. you are sorting;
3. you are doing joins or other merges with large data.
`dtplyr` will always be a little slower than `data.table`. However, this
slightly worse performance may be better than learning the sytax of
`data.table`.
---
class: inverse, center, middle
# Going forward
---
## Big data strategies
1. Avoid unnecessary copies of large objects
2. Downsample - you can't exceed $2 ^ 31 - 1$ rows, columns, or components
- Downsample to visualize and use summary statistics
- Downsample to wrangle and understand
- Downsample to model
3. Get more RAM - this is not easy or even sometimes an option
4. Parallelize - this is not always an option
- Execute a chunk and pull strategy
---
## References
1. Data Table Back-End for dplyr. (2020).
https://dtplyr.tidyverse.org/index.html.
2. Read and Write Rectangular Text Data Quickly. (2020).
https://vroom.r-lib.org/
3. Wickham, H. (2019). Advanced R. https://adv-r.hadley.nz/