March 30, 2017
The slides, data, and all source files for this presentation can be found at
The following are packages we will be using this evening:
dplyr
, ggplot2
readr
, data.table
pryr
microbenchmark
, rbenchmark
, profvis
Generally whenever I use a package I will load it before the relevant code.
file.info(list.files("data",full.names = TRUE))
## size isdir mode mtime ctime ## data/2008.csv 689413344 FALSE 644 2017-03-30 11:50:49 2017-03-30 11:50:49 ## data/2008.csv.bz2 113753229 FALSE 644 2017-03-30 11:38:00 2017-03-30 11:38:04 ## atime uid gid uname grname ## data/2008.csv 2017-03-30 16:56:58 502 20 rundel staff ## data/2008.csv.bz2 2017-03-30 13:53:37 502 20 rundel staff
system.time({flights = read.csv("data/2008.csv")}) ## user system elapsed ## 89.693 4.299 95.563
system.time({flights = read.csv("data/2008.csv.bz2")}) ## user system elapsed ## 179.085 4.487 186.402
system.time({flights = read.csv("data/2008.csv", comment.char="", stringsAsFactors=FALSE)}) ## user system elapsed ## 85.795 3.545 90.966
library(readr) system.time({flights = read_csv("data/2008.csv.bz2")}) ## |==================================================================================| 100% 657 MB ## user system elapsed ## 56.820 13.262 74.163
library(data.table) system.time({flights = fread("data/2008.csv", showProgress = FALSE)})
## user system elapsed ## 7.359 0.854 9.755
fread
Some people like data.table
, others do not. If you don't intend to use data.table
functionality after using fread it is a good idea to fix the class of your object after reading it in.
class(flights)
## [1] "data.table" "data.frame"
class(flights) = "data.frame" class(flights)
## [1] "data.frame"
A lot of the difficulty and slowness of reading CSV files comes from trying to figure out what type each column should have. We can save ourselves this trouble later on by saving our data as a binary Rdata file.
save(flights, file="data/flights.Rdata") system.time(load(file="data/flights.Rdata")) ## user system elapsed ## 10.563 0.399 11.027
Task | Timing |
---|---|
L1 cache reference | 0.5 ns |
L2 cache reference | 7 ns |
Main memory reference | 100 ns |
Read 1 MB sequentially from memory | 250,000 ns |
Read 1 MB sequentially from SSD | 1,000,000 ns |
Disk seek | 10,000,000 ns |
Read 1 MB sequentially from network | 10,000,000 ns |
Read 1 MB sequentially from spinning disk | 30,000,000 ns |
Send packet CA -> Netherlands -> CA | 150,000,000 ns |
Lets imagine we have a 10 GB flat data file and that we want to select certain rows based on a given criteria. This requires a sequential read across the entire data set.
If we can store the file in memory:
If we have to access the file from an SSD:
If we have to access the file from a spinning disk:
This is just for reading data, if we make any modifications (writing) things are much worse. Also, this is actually incredibly optimistic since it assumes that all the data on disk can be read in one continuous read.
The development of databases (e.g. MySQL, Oracle, etc.) was to solve this problem: what do you do when data size >> available memory.
DataFest data is general on the scale of 1-2 GB
Small enough to fit in memory on most laptops
Large enough that you need to be careful (many operations will create hidden copys)
A simple but useful trick is to only work with the full data set when you absolutely have to. We are generally interested in the large scale patterns / features of the data and these are generally preserved by randomly sampling that data (this keeps statisticians employed).
library(dplyr) library(pryr) set.seed(20170330) flights_sub = flights %>% sample_frac(0.1)
object.size(flights)
## 953617184 bytes
object_size(flights)
## 954 MB
object.size(flights_sub)
## 98427752 bytes
object_size(flights_sub)
## 98.4 MB
If you are not using the original object it is a good idea to delete it to free up memory (can always read it back in again later).
rm(flights) gc() ## used (Mb) gc trigger (Mb) max used (Mb) ## Ncells 1062473 56.8 1770749 94.6 1770749 94.6 ## Vcells 14146350 108.0 258829281 1974.8 577295819 4404.5
Generally we want to avoid for loops, but that isn't always possible. For slow loops it is a good idea to track our progress (and know much time is left).
library(dplyr) p = progress_estimated(50, min_time = 0) for(i in 1:50) { # Calculate something compliated Sys.sleep(0.1) p$tick()$print() }
The simplest tool is base R's system.time
which can be used to wrap any other call or calls.
system.time(rnorm(1e6))
## user system elapsed ## 0.095 0.002 0.106
system.time(rnorm(1e4) %*% t(rnorm(1e4)))
## user system elapsed ## 0.566 0.321 1.001
microbenchmark
We can do better (better precision) using the microbenchmark package
library(microbenchmark) d = abs(rnorm(1000)) r = microbenchmark( exp(log(d)/2), d^0.5, sqrt(d), times = 1000 ) print(r)
## Unit: microseconds ## expr min lq mean median uq max neval ## exp(log(d)/2) 14.848 15.439 20.004378 17.9640 19.167 219.591 1000 ## d^0.5 24.195 24.582 32.358537 27.2815 28.820 2003.781 1000 ## sqrt(d) 2.337 2.775 5.910073 5.1995 5.718 894.631 1000
boxplot(r)
rbenchmark
We can also use the rbenchmark
package
library(rbenchmark) d = abs(rnorm(1000)) benchmark( exp(log(d)/2), d^0.5, sqrt(d), replications = 1000, order = "relative" )
## test replications elapsed relative user.self sys.self user.child sys.child ## 3 sqrt(d) 1000 0.004 1.00 0.004 0.000 0 0 ## 1 exp(log(d)/2) 1000 0.025 6.25 0.023 0.000 0 0 ## 2 d^0.5 1000 0.029 7.25 0.029 0.001 0 0
library(profvis) set.seed(20170330) flights_small = flights %>% sample_n(100000)
profvis({ m = lm(AirTime ~ Distance, data = flights_small) plot(AirTime ~ Distance, data = flights_small) abline(m, col = "red") })
Sample, sample, sample
Save your code and important data objects regularly - bigger data can and will kill your R session
If something is taking a long time - kill it and figure out why (or at least how long it will take)
Vectorized >> *apply
/ map*
>> for
/ while
Make use of Activity Monitor (OSX) / Task Manager (Windows) to monitor CPU / Memory usage.
for_grow = function(x) { res = c() for(x_i in x) res = c(res,sqrt(x_i)) res } for_init = function(x) { res = rep(NA, length(x)) for(i in seq_along(x)) res[i] = sqrt(x[i]) } vect = function(x) { sqrt(x) } dplyr_mutate = function(x) { data.frame(x=x) %>% mutate(s = sqrt(x)) } dplyr_transmute = function(x) { data.frame(x=x) %>% transmute(s = sqrt(x)) }
library(rbenchmark) d = abs(rnorm(1000)) benchmark( for_grow(d), for_init(d), vect(d), dplyr_mutate(d), dplyr_transmute(d), replications = 100, relative = "elapsed" ) %>% arrange(elapsed) %>% select(test:relative) ## test replications elapsed relative ## 1 vect(d) 100 0.001 1 ## 2 dplyr_mutate(d) 100 0.077 77 ## 3 for_init(d) 100 0.097 97 ## 4 dplyr_transmute(d) 100 0.099 99 ## 5 for_grow(d) 100 0.290 290
d = abs(rnorm(10000)) benchmark( for_grow(d), for_init(d), vect(d), dplyr_mutate(d), dplyr_transmute(d), replications = 100, relative = "elapsed" ) %>% arrange(elapsed) %>% select(test:relative) ## test replications elapsed relative ## 1 vect(d) 100 0.003 1.000 ## 2 dplyr_mutate(d) 100 0.060 20.000 ## 3 dplyr_transmute(d) 100 0.090 30.000 ## 4 for_init(d) 100 0.862 287.333 ## 5 for_grow(d) 100 36.920 12306.667
With large data sets overplotting is often a serious issue,
plot(AirTime ~ Distance, data = flights_small)
This can be addresseds with transparency (alpha) and plot character sizes (or binning)
plot(AirTime ~ Distance, data = flights_small, pch=16, col=adjustcolor("black",alpha.f=0.01), cex=0.5)
library(ggplot2) ggplot(flights_small, aes(y=AirTime,x=Distance)) + geom_point(alpha=0.01, size=0.5)
ggplot(flights_small, aes(y=AirTime,x=Distance)) + geom_hex()
ggplot(flights_small, aes(y=AirTime,x=Distance)) + geom_hex(aes(fill=sqrt(..count..)))
ggplot(flights_small, aes(y=AirTime,x=Distance)) + geom_hex(aes(fill=log(..count..)))
ggplot(flights_small, aes(y=AirTime,x=Distance)) + geom_point(data=sample_n(flights,100)) + geom_density_2d(alpha=0.5)
PDF and other SVG type plots are useful because you can adjust the scale and zoom, but when there are many many plot objects they can be painfully slow. For these cases consider creating an image (fixed resolution) based plot (e.g. png) instead.
png("time_vs_dist.png", width=1024, height=800) ggplot(flights_small, aes(y=AirTime,x=Distance)) + geom_point(alpha=0.01, size=0.5) dev.off()
The different parts of the raw data or any additional outside data you find may need to be merged together. This can be down with dplyr's join functions or base R's merge - but it important to think about the nature of the relationship between the two data sets.
A
corresponds to either 0 or 1 entries in B
A
corresponds to 0 or more entries in B
, or vice versa.A
corresponds to 0 or more entries in B
and each entry in B
corresponds to 0 or more entries in A
.addr = data.frame(name = c("Alice","Alice", "Bob","Bob"), email= c("alice@company.com","alice@gmail.com", "bob@company.com","bob@hotmail.com"), stringsAsFactors = FALSE) phone = data.frame(name = c("Alice","Alice", "Bob","Bob"), phone= c("919 555-1111", "310 555-2222", "919 555-3333", "310 555-3333"), stringsAsFactors = FALSE) library(dplyr) full_join(addr, phone, by="name")
## name email phone ## 1 Alice alice@company.com 919 555-1111 ## 2 Alice alice@company.com 310 555-2222 ## 3 Alice alice@gmail.com 919 555-1111 ## 4 Alice alice@gmail.com 310 555-2222 ## 5 Bob bob@company.com 919 555-3333 ## 6 Bob bob@company.com 310 555-3333 ## 7 Bob bob@hotmail.com 919 555-3333 ## 8 Bob bob@hotmail.com 310 555-3333
A = data.frame(common="C", A_values=1:1000) B = data.frame(common="C", B_values=1:1000) inner_join(A,B) %>% tbl_df()
## Joining, by = "common"
## # A tibble: 1,000,000 × 3 ## common A_values B_values ## <fctr> <int> <int> ## 1 C 1 1 ## 2 C 1 2 ## 3 C 1 3 ## 4 C 1 4 ## 5 C 1 5 ## 6 C 1 6 ## 7 C 1 7 ## 8 C 1 8 ## 9 C 1 9 ## 10 C 1 10 ## # ... with 999,990 more rows
Some advice given by Mark Suchard at UCLA (unsure of who said this originally)
Linear complexity \((\mathcal{O}(n))\) - Great
sqrt
, lm
Quadratic complexity \((\mathcal{O}(n^2))\) - Pray
dist
Cubic complexity \((\mathcal{O}(n^3))\) - Give up
%*%
, solve
, chol