Packages

library(lobstr)
library(tidyverse)
library(lubridate)
library(multidplyr)

Exercise 1

Problem

Can you diagnose what is going on below?

x <- 1:10
y <- x

tracemem(x)
#> [1] "<0x7f8c4bfcfcc0>"
c(obj_addr(x), obj_addr(y))
#> [1] "0x7f8c4bfcfcc0" "0x7f8c4bfcfcc0"
y[1] <- 3
#> tracemem[0x7f8c4bfcfcc0 -> 0x7f8c4d73f158]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> 
#> tracemem[0x7f8c4d73f158 -> 0x7f8c4dbaed28]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous>

Solution

The question is, why are two copies being made? The vector x is of type integer. However, when we do subassignment and change the first component of y to be 3 (of type double) two copies are made. One for the modification of the component, the other for the atomic vector type change.

x <- 1:10
y <- x

tracemem(x)
#> [1] "<0x7f8c4d1a4860>"
c(obj_addr(x), obj_addr(y))
#> [1] "0x7f8c4d1a4860" "0x7f8c4d1a4860"
y[1] <- 3L # type integer
#> tracemem[0x7f8c4d1a4860 -> 0x7f8c4d7c6a78]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous>

Exercise 2

Problem

Starting from 0 we can see that

lobstr::obj_size(integer(0))
#> 48 B
lobstr::obj_size(numeric(0))
#> 48 B

are both 48 bytes. Run the code below and see if you can deduce how R handles these numeric data in memory?

diff(sapply(0:100, function(x) obj_size(integer(x))))
c(obj_size(integer(20)), obj_size(integer(22)))
diff(sapply(0:100, function(x) obj_size(numeric(x))))
c(obj_size(numeric(10)), obj_size(numeric(14)))

Solution

R allocates memory to vectors in chunks. An integer vector of length one is allocated 56 bytes, 8 more than a null integer vector. Since an integer component only requires 4 bytes of memory, an integer vector of length two is also only 56 bytes. R does not need any more memory. Hence, we see that obj_size(integer(1)) and obj_size(integer(2)) are the same. The diff() function calls give you an idea as to how memory is allocated in chunks.

Exercise 3

Problem

Start with the below code to create a partitioned data frame spread across three clusters.

clust <- new_cluster(3)

base_url <- "http://www2.stat.duke.edu/~sms185/data/bike/"
files <- c("cbs_2015.csv", "cbs_2016.csv", "cbs_2017.csv")

cluster_assign_partition(clust, file_name = str_c(base_url, files))
cluster_send(clust, cbs_data <- vroom::vroom(file_name))

cbs <- party_df(clust, "cbs_data")
  1. Bring the three tibbles together with collect().

  2. Fix the names with janitor::clean_names().

  3. Add a variable year.

  4. Partition the tibble onto three clusters grouped by year. Check you have Shards: 3 [3,185,906--3,757,777 rows].

  5. Use lubridate::month(), lubridate::day() and lubridate::wday() to parse start_date and end_date, and include the new variables in the tibble.

  6. Compute the median ride duration for each year-month-wday combination; bring everything back together as an object named cbs.

Solution

cbs <- cbs %>% 
  collect()

cbs <- janitor::clean_names(cbs) %>% 
  mutate(year = str_extract(start_date, pattern = "\\d{4}"))

clust <- new_cluster(3)
cbs_year <- cbs %>% 
  group_by(year) %>% 
  partition(clust)

cbs <- cbs_year %>% 
  mutate(start_month = lubridate::month(start_date),
         start_day   = lubridate::day(start_date),
         start_wday  = lubridate::wday(start_date, label = TRUE),
         end_month   = lubridate::month(end_date),
         end_day     = lubridate::day(end_date),
         end_wday    = lubridate::wday(end_date, label = TRUE)
         ) %>% 
  group_by(year, start_month, start_wday) %>% 
  summarise(med = median(duration)) %>% 
  collect()