library(lobstr)
library(tidyverse)
library(lubridate)
library(multidplyr)
Can you diagnose what is going on below?
x <- 1:10
y <- x
tracemem(x)
#> [1] "<0x7f8c4bfcfcc0>"
c(obj_addr(x), obj_addr(y))
#> [1] "0x7f8c4bfcfcc0" "0x7f8c4bfcfcc0"
y[1] <- 3
#> tracemem[0x7f8c4bfcfcc0 -> 0x7f8c4d73f158]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous>
#> tracemem[0x7f8c4d73f158 -> 0x7f8c4dbaed28]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous>
The question is, why are two copies being made? The vector x
is of type integer. However, when we do subassignment and change the first component of y
to be 3 (of type double) two copies are made. One for the modification of the component, the other for the atomic vector type change.
x <- 1:10
y <- x
tracemem(x)
#> [1] "<0x7f8c4d1a4860>"
c(obj_addr(x), obj_addr(y))
#> [1] "0x7f8c4d1a4860" "0x7f8c4d1a4860"
y[1] <- 3L # type integer
#> tracemem[0x7f8c4d1a4860 -> 0x7f8c4d7c6a78]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous>
Starting from 0 we can see that
lobstr::obj_size(integer(0))
#> 48 B
lobstr::obj_size(numeric(0))
#> 48 B
are both 48 bytes. Run the code below and see if you can deduce how R handles these numeric data in memory?
diff(sapply(0:100, function(x) obj_size(integer(x))))
c(obj_size(integer(20)), obj_size(integer(22)))
diff(sapply(0:100, function(x) obj_size(numeric(x))))
c(obj_size(numeric(10)), obj_size(numeric(14)))
R allocates memory to vectors in chunks. An integer vector of length one is allocated 56 bytes, 8 more than a null integer vector. Since an integer component only requires 4 bytes of memory, an integer vector of length two is also only 56 bytes. R does not need any more memory. Hence, we see that obj_size(integer(1))
and obj_size(integer(2))
are the same. The diff()
function calls give you an idea as to how memory is allocated in chunks.
Start with the below code to create a partitioned data frame spread across three clusters.
clust <- new_cluster(3)
base_url <- "http://www2.stat.duke.edu/~sms185/data/bike/"
files <- c("cbs_2015.csv", "cbs_2016.csv", "cbs_2017.csv")
cluster_assign_partition(clust, file_name = str_c(base_url, files))
cluster_send(clust, cbs_data <- vroom::vroom(file_name))
cbs <- party_df(clust, "cbs_data")
Bring the three tibbles together with collect()
.
Fix the names with janitor::clean_names()
.
Add a variable year
.
Partition the tibble onto three clusters grouped by year
. Check you have Shards: 3 [3,185,906--3,757,777 rows]
.
Use lubridate::month()
, lubridate::day()
and lubridate::wday()
to parse start_date
and end_date
, and include the new variables in the tibble.
Compute the median ride duration for each year-month-wday combination; bring everything back together as an object named cbs
.
cbs <- cbs %>%
collect()
cbs <- janitor::clean_names(cbs) %>%
mutate(year = str_extract(start_date, pattern = "\\d{4}"))
clust <- new_cluster(3)
cbs_year <- cbs %>%
group_by(year) %>%
partition(clust)
cbs <- cbs_year %>%
mutate(start_month = lubridate::month(start_date),
start_day = lubridate::day(start_date),
start_wday = lubridate::wday(start_date, label = TRUE),
end_month = lubridate::month(end_date),
end_day = lubridate::day(end_date),
end_wday = lubridate::wday(end_date, label = TRUE)
) %>%
group_by(year, start_month, start_wday) %>%
summarise(med = median(duration)) %>%
collect()