March 30, 2017

Set up & Configuration

Datasets, presentation, source code

  • The slides, data, and all source files for this presentation can be found at

    • www.stat.duke.edu/~cr173/datafest/df2017/


  • That data we will be using today is available via the ASA's 2009 Data Expo. The original data set was information on all domestic flights in the US between 1987 and 2008. We will be limiting ourselves to 2008 to keep the data sizes manageable.

Packages

The following are packages we will be using this evening:

  • dplyr, ggplot2

  • readr, data.table

  • pryr

  • microbenchmark, rbenchmark, profvis

Generally whenever I use a package I will load it before the relevant code.

Getting data into R

Know your data (size)

file.info(list.files("data",full.names = TRUE))
##                        size isdir mode               mtime               ctime
## data/2008.csv     689413344 FALSE  644 2017-03-30 11:50:49 2017-03-30 11:50:49
## data/2008.csv.bz2 113753229 FALSE  644 2017-03-30 11:38:00 2017-03-30 11:38:04
##                                 atime uid gid  uname grname
## data/2008.csv     2017-03-30 16:56:58 502  20 rundel  staff
## data/2008.csv.bz2 2017-03-30 13:53:37 502  20 rundel  staff

Reading CSV - Base R

system.time({flights = read.csv("data/2008.csv")})

##   user  system elapsed 
## 89.693   4.299  95.563 
system.time({flights = read.csv("data/2008.csv.bz2")})

##    user  system elapsed 
## 179.085   4.487 186.402 
system.time({flights = read.csv("data/2008.csv", comment.char="", stringsAsFactors=FALSE)})

##   user  system elapsed 
## 85.795   3.545  90.966 

Reading CSV - Fancy R

library(readr)
system.time({flights = read_csv("data/2008.csv.bz2")})

## |==================================================================================| 100%  657 MB
##    user  system elapsed 
##  56.820  13.262  74.163 
library(data.table)
system.time({flights = fread("data/2008.csv", showProgress = FALSE)})
##    user  system elapsed 
##   7.359   0.854   9.755

Aside - cleaning up fread

Some people like data.table, others do not. If you don't intend to use data.table functionality after using fread it is a good idea to fix the class of your object after reading it in.

class(flights)
## [1] "data.table" "data.frame"
class(flights) = "data.frame"

class(flights)
## [1] "data.frame"

Aside - Saving R objects

A lot of the difficulty and slowness of reading CSV files comes from trying to figure out what type each column should have. We can save ourselves this trouble later on by saving our data as a binary Rdata file.

save(flights, file="data/flights.Rdata")
system.time(load(file="data/flights.Rdata"))

##   user  system elapsed 
## 10.563   0.399  11.027

A foray into algorithms

Numbers programmer / everyone should know

Task Timing
L1 cache reference 0.5 ns
L2 cache reference 7 ns
Main memory reference 100 ns
Read 1 MB sequentially from memory 250,000 ns
Read 1 MB sequentially from SSD 1,000,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from network 10,000,000 ns
Read 1 MB sequentially from spinning disk 30,000,000 ns
Send packet CA -> Netherlands -> CA 150,000,000 ns

From: https://gist.github.com/jboner/2841832

Implications for bigish data

Lets imagine we have a 10 GB flat data file and that we want to select certain rows based on a given criteria. This requires a sequential read across the entire data set.

If we can store the file in memory:

  • \(10~GB \times (250~\mu s / 1~MB) = 2.5\) seconds

If we have to access the file from an SSD:

  • \(10~GB \times (1~ms / 1~MB) = 10\) seconds

If we have to access the file from a spinning disk:

  • \(10~GB \times (30~ms / 1~MB) = 300\) seconds


This is just for reading data, if we make any modifications (writing) things are much worse. Also, this is actually incredibly optimistic since it assumes that all the data on disk can be read in one continuous read.

Moral of the story

Practical Advice

The development of databases (e.g. MySQL, Oracle, etc.) was to solve this problem: what do you do when data size >> available memory.


  • Generally speaking they are a great tool but a bit of overkill for DataFest


  • DataFest data is general on the scale of 1-2 GB

    • Small enough to fit in memory on most laptops

    • Large enough that you need to be careful (many operations will create hidden copys)

Sampling FTW

A simple but useful trick is to only work with the full data set when you absolutely have to. We are generally interested in the large scale patterns / features of the data and these are generally preserved by randomly sampling that data (this keeps statisticians employed).

library(dplyr)
library(pryr)

set.seed(20170330)
flights_sub = flights %>% sample_frac(0.1)
object.size(flights)
## 953617184 bytes
object_size(flights)
## 954 MB
object.size(flights_sub)
## 98427752 bytes
object_size(flights_sub)
## 98.4 MB

Aside - Deleting Objects

If you are not using the original object it is a good idea to delete it to free up memory (can always read it back in again later).

rm(flights)
gc()

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  1062473  56.8    1770749   94.6   1770749   94.6
## Vcells 14146350 108.0  258829281 1974.8 577295819 4404.5

Other useful tools

Progress bars (via dplyr)

Generally we want to avoid for loops, but that isn't always possible. For slow loops it is a good idea to track our progress (and know much time is left).

library(dplyr)

p = progress_estimated(50, min_time = 0)
for(i in 1:50)
{
  # Calculate something compliated
  Sys.sleep(0.1)
  
  p$tick()$print()
}

Simple benchmarking

The simplest tool is base R's system.time which can be used to wrap any other call or calls.

system.time(rnorm(1e6))
##    user  system elapsed 
##   0.095   0.002   0.106
system.time(rnorm(1e4) %*% t(rnorm(1e4)))
##    user  system elapsed 
##   0.566   0.321   1.001

Better benchmarking - microbenchmark

We can do better (better precision) using the microbenchmark package

library(microbenchmark)

d = abs(rnorm(1000))
r = microbenchmark(
      exp(log(d)/2),
      d^0.5,
      sqrt(d),
      times = 1000
    )
print(r)
## Unit: microseconds
##           expr    min     lq      mean  median     uq      max neval
##  exp(log(d)/2) 14.848 15.439 20.004378 17.9640 19.167  219.591  1000
##          d^0.5 24.195 24.582 32.358537 27.2815 28.820 2003.781  1000
##        sqrt(d)  2.337  2.775  5.910073  5.1995  5.718  894.631  1000

boxplot(r)

Better benchmarking - rbenchmark

We can also use the rbenchmark package

library(rbenchmark)

d = abs(rnorm(1000))
benchmark(
  exp(log(d)/2),
  d^0.5,
  sqrt(d),
  replications = 1000,
  order = "relative"
)
##            test replications elapsed relative user.self sys.self user.child sys.child
## 3       sqrt(d)         1000   0.004     1.00     0.004    0.000          0         0
## 1 exp(log(d)/2)         1000   0.025     6.25     0.023    0.000          0         0
## 2         d^0.5         1000   0.029     7.25     0.029    0.001          0         0

Profiling

library(profvis)

set.seed(20170330)
flights_small = flights %>% sample_n(100000)
profvis({
  m = lm(AirTime ~ Distance, data = flights_small)
  plot(AirTime ~ Distance, data = flights_small)
  abline(m, col = "red")
})

General Advice and
Pearls of Wisdom

General Advice

  • Sample, sample, sample

    • and also maybe subset (rows and columns)
  • Save your code and important data objects regularly - bigger data can and will kill your R session

  • If something is taking a long time - kill it and figure out why (or at least how long it will take)

  • Vectorized >> *apply / map* >> for / while

What's happening?

Make use of Activity Monitor (OSX) / Task Manager (Windows) to monitor CPU / Memory usage.