Packages

library(tidyverse)
library(nycflights13)
library(disk.frame)
setup_disk.frame(workers = 4)

# this will allow unlimited amount of data to be 
# passed from worker to worker
options(future.globals.maxSize = Inf)

Create a disk.frame

flights_disk <- as.disk.frame(df = flights, outdir = "tmp_flights.df",
                              nchunks = 10, overwrite = TRUE)

Exercise 1

Problem

Use flights_disk and compute the mean, median, and IQR for departure delay for each carrier. Arrange the carriers alphabetically. Compare your result to using flights.

Solution

Using flights_disk

flights_disk %>% 
  select(carrier, dep_delay) %>% 
  group_by(carrier) %>% 
  summarise(
    mean_dep_delay = mean(dep_delay, na.rm = TRUE),
    median_dep_delay = median(dep_delay, na.rm = TRUE),
    iqr_dep_delay = IQR(dep_delay, na.rm = TRUE)
  ) %>% 
  collect() %>% 
  arrange(carrier)

Using flights

flights %>% 
  select(carrier, dep_delay) %>% 
  group_by(carrier) %>% 
  summarise(
    mean_dep_delay = mean(dep_delay, na.rm = TRUE),
    median_dep_delay = median(dep_delay, na.rm = TRUE),
    iqr_dep_delay = IQR(dep_delay, na.rm = TRUE)
  ) %>% 
  collect() %>% 
  arrange(carrier)

Exercise 2

Problem

Run the following code. How do you think the sampling is being done?

flights_disk %>% 
  sample_frac(size = .01) %>% 
  collect() %>% 
  as_tibble()

Solution

The sampling is being done on a chunk-by-chunk basis. As a result, you may not get exactly 1% of the total rows in flights.

Exercise 3

Problem

On the server, copy the capital bikeshare datasets to your home directory with

cp -rf cbs_data/ ~/

Create a disk.frame object using all the CSV files. Check how many rows and variables you have. Finally, create a visualization showing the mean duration bike ride for each station by member type. However, only show the 10 stations with the longest average.

Solution

Set-up the disk.frame with

file_list <- list.files(path = "cbs_data/", full.names = TRUE)
bikes_disk <- csv_to_disk.frame(file_list, outdir = "tmp_bike.df")

Aggregate data for visualization

bikes_disk %>% 
  select(duration = Duration, start_station = `Start station`,
         member = `Member type`) %>% 
  group_by(start_station, member) %>% 
  summarise(mean_duration = mean(duration, na.rm = TRUE)) %>% 
  collect()