library(tidyverse)
library(nycflights13)
library(disk.frame)
setup_disk.frame(workers = 4)
# this will allow unlimited amount of data to be
# passed from worker to worker
options(future.globals.maxSize = Inf)
disk.frame
flights_disk <- as.disk.frame(df = flights, outdir = "tmp_flights.df",
nchunks = 10, overwrite = TRUE)
Use flights_disk
and compute the mean, median, and IQR for departure delay for each carrier. Arrange the carriers alphabetically. Compare your result to using flights
.
Using flights_disk
flights_disk %>%
select(carrier, dep_delay) %>%
group_by(carrier) %>%
summarise(
mean_dep_delay = mean(dep_delay, na.rm = TRUE),
median_dep_delay = median(dep_delay, na.rm = TRUE),
iqr_dep_delay = IQR(dep_delay, na.rm = TRUE)
) %>%
collect() %>%
arrange(carrier)
Using flights
flights %>%
select(carrier, dep_delay) %>%
group_by(carrier) %>%
summarise(
mean_dep_delay = mean(dep_delay, na.rm = TRUE),
median_dep_delay = median(dep_delay, na.rm = TRUE),
iqr_dep_delay = IQR(dep_delay, na.rm = TRUE)
) %>%
collect() %>%
arrange(carrier)
Run the following code. How do you think the sampling is being done?
flights_disk %>%
sample_frac(size = .01) %>%
collect() %>%
as_tibble()
The sampling is being done on a chunk-by-chunk basis. As a result, you may not get exactly 1% of the total rows in flights
.
On the server, copy the capital bikeshare datasets to your home directory with
cp -rf cbs_data/ ~/
Create a disk.frame
object using all the CSV files. Check how many rows and variables you have. Finally, create a visualization showing the mean duration bike ride for each station by member type. However, only show the 10 stations with the longest average.
Set-up the disk.frame
with
file_list <- list.files(path = "cbs_data/", full.names = TRUE)
bikes_disk <- csv_to_disk.frame(file_list, outdir = "tmp_bike.df")
Aggregate data for visualization
bikes_disk %>%
select(duration = Duration, start_station = `Start station`,
member = `Member type`) %>%
group_by(start_station, member) %>%
summarise(mean_duration = mean(duration, na.rm = TRUE)) %>%
collect()