Benchmarking & Parallelization

Profiling & Benchmarking

Improved performance comes from iteration, and learning the most common pitfalls
Don't sweat the small stuff - Coder time vs Run time vs Compute costs
Measure it, or it didn't happen
"Premature optimization is the root of all evil (or at least most of it) in programming." -Knuth

How do we measure?

Simplest tool is R's base system.time which can be used to wrap any other call or calls.

system.time({rnorm(1e6)})

##    user  system elapsed 
##   0.115   0.004   0.121

system.time({rnorm(1e4) %*% t(rnorm(1e4))})

##    user  system elapsed 
##   0.493   0.257   0.608

Better benchmarking (pt. 1)

We can do better (better precision) using the microbenchmark package

install.packages("microbenchmark")

library(microbenchmark)

d = abs(rnorm(1000))
r = microbenchmark(
      exp(log(d)/2),
      d^0.5,
      sqrt(d),
      times = 1000
    )
print(r)

## Unit: microseconds
##           expr    min      lq      mean median     uq     max neval
##  exp(log(d)/2) 16.193 19.0820 20.930554 19.494 20.772 119.023  1000
##          d^0.5 25.440 28.4740 30.441295 29.073 30.563 156.220  1000
##        sqrt(d)  2.455  4.9585  5.792436  5.450  6.088  67.758  1000

boxplot(r)

Better benchmarking (pt. 2)

We can also do better using the rbenchmark package

install.packages("rbenchmark")

library(rbenchmark)

d = abs(rnorm(1000))
benchmark(
  exp(log(d)/2),
  d^0.5,
  sqrt(d),
  replications = 1000,
  order = "relative"
)

##            test replications elapsed relative user.self sys.self user.child sys.child
## 3       sqrt(d)         1000   0.008    1.000     0.006    0.001          0         0
## 1 exp(log(d)/2)         1000   0.026    3.250     0.024    0.001          0         0
## 2         d^0.5         1000   0.037    4.625     0.035    0.000          0         0

Example 1

Earlier we mentioned that growing a vector as you collect results is bad, just how bad is it? Benchmark the following three functions and compare their performance.

good = function() {
    res = rep(NA, 1e4)
    for(i in seq_along(res))
        res[i] = sqrt(i)
}

bad = function() {
    res = numeric()
    for(i in 1:1e4)
        res = c(res,sqrt(i))

}

best = function() {
    sqrt(1:1e4)
}

Example 2

Lets compare looping vs. the map (apply) function vs dplyr.

First we will construct a large data frame

set.seed(523)
d = data_frame(
  A = rnorm(1e5),
  B = runif(1e5),
  C = rexp(1e5),
  D = sample(letters, 1e5, replace=TRUE)               
)

Implement functions that will return a data frame containing the maximum value of each column
- A map function
- A single for loop
- dplyr
Benchmark all of your preceding functions using data frame d, which is the fastest, why do you think this is the case? 10 replicates per function is sufficient.
What would happen if d was smaller? (e.g. take only the first 1000 rows)

Parallelization

`parallel`

Part of the base packages in R

tools for the forking of R processes (some functions do not work on Windows)
Core functions:
- detectCores
- pvec
- mclapply
- mcparallel & mccollect

detectCores

Surprisingly, detects the number of cores of the current system.

detectCores()

## [1] 24

pvec

Parallelization of a vectorized function call

system.time(pvec(1:1e7, sqrt, mc.cores = 1))

##   user  system elapsed 
##  0.214   0.029   0.243 

system.time(pvec(1:1e7, sqrt, mc.cores = 4))

##   user  system elapsed 
##  0.442   0.185   0.631 

system.time(pvec(1:1e7, sqrt, mc.cores = 8))

##   user  system elapsed 
##  0.532   0.389   0.372

cores = c(1,2,4,8,16)
order = 6:8
res = map(
  cores, 
  function(x) {
     map_dbl(order, function(y) system.time(pvec(1:(10^y), sqrt, mc.cores=x))[3]) 
  }
) %>% do.call(rbind,.)
rownames(res) = paste0(cores," cores")
colnames(res) = paste0("10^",order)
res

##            10^6  10^7  10^8
##  1 cores  0.016 0.282 1.489
##  2 cores  0.070 0.526 5.198
##  4 cores  0.052 0.430 5.023
##  8 cores  0.054 0.376 4.098
##  16 cores 0.073 0.401 4.049

mclapply

Parallelized version of lapply

system.time(rnorm(1e6))

##   user  system elapsed 
##  0.101   0.007   0.107 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 2)))

##   user  system elapsed 
##  0.148   0.136   0.106 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4)))

##   user  system elapsed 
##  0.242   0.061   0.052 ```

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4)))

##   user  system elapsed 
##  0.097   0.047   0.079 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 8)))

##   user  system elapsed 
##  0.193   0.076   0.040 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 10)))

##   user  system elapsed 
##  0.162   0.083   0.041 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 12)))

##   user  system elapsed 
##  0.098   0.065   0.037

mcparallel

Asynchronously evaluation of an R expression in a separate process

m = mcparallel(rnorm(1e6))
n = mcparallel(rbeta(1e6,1,1))
o = mcparallel(rgamma(1e6,1,1))

str(m)

## List of 2
##  $ pid: int 17047
##  $ fd : int [1:2] 4 7
##  - attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process"

str(n)

## List of 2
##  $ pid: int 17048
##  $ fd : int [1:2] 5 9
##  - attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process"

mccollect

Checks mcparallel objects for completion

str(mccollect(list(m,n,o)))

## List of 3
##  $ 17047: num [1:1000000] -0.95628 -0.00722 -0.88369 -1.4537 1.69245 ...
##  $ 17048: num [1:1000000] 0.127 0.0341 0.7402 0.3991 0.853 ...
##  $ 17049: num [1:1000000] 0.214 4.089 1.541 1.017 1.957 ...

mccollect - waiting

p = mcparallel(mean(rnorm(1e5)))
mccollect(p, wait = FALSE, 10) # will retrieve the result (since it's fast)

## $`17050`
## [1] 0.000962806

mccollect(p, wait = FALSE)     # will signal the job as terminating

## $`17050`
## NULL

mccollect(p, wait = FALSE)     # there is no longer such a job

## NULL

doMC & foreach

Packages by Revolution Analytics that provides the foreach function which is a parallelizable for loop (and then some).

Core functions:
- registerDoMC
- foreach, %dopar%, %do%

registerDoMC

Primarily used to set the number of cores used by foreach, by default uses options("cores") or half the number of cores found by detectCores from the parallel package.

options("cores")

## $cores
## NULL

detectCores()

## [1] 24

getDoParWorkers()

## [1] 1

registerDoMC(4)
getDoParWorkers()

## [1] 4

foreach

A slightly more powerful version of base for loops (think for with an lapply flavor). Combined with %do% or %dopar% for single or multicore execution.

for(i in 1:10) sqrt(i)

foreach(i = 1:5) %do% sqrt(i)

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1.414214
## 
## [[3]]
## [1] 1.732051
## 
## [[4]]
## [1] 2
## 
## [[5]]
## [1] 2.236068

foreach - iterators

foreach can iterate across more than one value, but it doesn't do length coercion

foreach(i = 1:5, j = 1:5) %do% sqrt(i^2+j^2)

## [[1]]
## [1] 1.414214
## 
## [[2]]
## [1] 2.828427
## 
## [[3]]
## [1] 4.242641
## 
## [[4]]
## [1] 5.656854
## 
## [[5]]
## [1] 7.071068

foreach(i = 1:5, j = 1:2) %do% sqrt(i^2+j^2)

## [[1]]
## [1] 1.414214
## 
## [[2]]
## [1] 2.828427

foreach - combining results

foreach(i = 1:5, .combine='c') %do% sqrt(i)

## [1] 1.000000 1.414214 1.732051 2.000000 2.236068

foreach(i = 1:5, .combine='cbind') %do% sqrt(i)

##      result.1 result.2 result.3 result.4 result.5
## [1,]        1 1.414214 1.732051        2 2.236068

foreach(i = 1:5, .combine='+') %do% sqrt(i)

## [1] 8.382332

foreach - parallelization

Swapping out %do% for %dopar% will use the parallel backend.

registerDoMC(4)
system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))

##    user  system elapsed 
##   0.725   0.103   0.576

registerDoMC(8)
system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))

##    user  system elapsed 
##   1.178   0.147   0.546

registerDoMC(12)
system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))

##    user  system elapsed 
##   1.120   0.141   0.394

Example 3 - Bootstraping

Bootstrapping is a resampling scheme where the original data is repeatedly reconstructed by taking a sample (with replacement) of the same size as the original data, and using that to conduct whatever analysis procedure is of interest. Below is an example of fitting a local regression (loess) to some synthetic data, we will construct a bootstrap prediction interval for this model.

set.seed(3212016)
d = data.frame(x = 1:120) %>%
    mutate(y = sin(2*pi*x/120) + runif(length(x),-1,1))

l = loess(y ~ x, data=d)
d = d %>% mutate(
  pred_y = predict(l),
  pred_y_se = predict(l,se=TRUE)$se.fit
) %>% mutate(
  pred_low  = pred_y - 1.96 * pred_y_se,
  pred_high = pred_y + 1.96 * pred_y_se
)

ggplot(d, aes(x,y)) +
  geom_point(color="darkgrey") +
  geom_ribbon(aes(ymin=pred_low, ymax=pred_high), fill="red", alpha=0.25) +
  geom_line(aes(y=pred_y)) +
  theme_bw()

Example 3 - Cont.

We will now re-implement the code below using one of the parallelization techniques we have just discussed and will then check the performance with 1, 2, and 4 cores.

n_rep = 5000
d_xy = select(d, x, y)

res = map(1:n_rep, function(i) {
  d_xy %>% 
    select(x,y) %>% 
    sample_n(nrow(d), replace=TRUE) %>%
    loess(y ~ x, data=.) %>%
    predict(newdata=d) %>%
    setNames(NULL)
}) %>% do.call(cbind, .)

d = d %>% mutate(
  bs_low = apply(res,1,quantile,probs=c(0.025), na.rm=TRUE),
  bs_high  = apply(res,1,quantile,probs=c(0.975), na.rm=TRUE)
)

ggplot(d, aes(x,y)) +
  geom_point(color="gray50") +
  geom_ribbon(aes(ymin=pred_low, ymax=pred_high), fill="red", alpha=0.25) +
  geom_ribbon(aes(ymin=bs_low, ymax=bs_high), fill="blue", alpha=0.25) +
  geom_line(aes(y=pred_y)) +
  theme_bw()

What to use when?

Optimal use of multiple cores is hard, there isn't one best solution

Don't underestimate the overhead cost
More art than science - experimentation is key
Measure it or it didn't happen
Be aware of the trade off between developer time and run time

BLAS and LAPACK

Statistics and Linear Algebra

An awful lot of statistics is at its core linear algebra.

For example:

Linear regession models, find

\[ \hat{\beta} = (X^T X)^{-1} X^Ty \]

Principle component analysis
- Find \(T = XW\) where \(W\) is a matrix whose columns are the eigenvectors of \(X^TX\).
- Often solved via SVD - Let \(X = U\Sigma W^T\) then \(T = U\Sigma\).

Numerical Linear Algebra

Not unique to Statistics, these are the type of problems that come up across all areas of numerical computing.

Numerical linear algebra \(\ne\) mathematical linear algebra

Efficiency and stability of numerical algorithms matter
- Designing and implementing these algorithms is hard

Don't reinvent the wheel - common core linear algebra tools (well defined API)

BLAS and LAPACK

Low level algorithms for common linear algebra operations

BLAS

Basic Linear Algebra Subprograms
Copying, scaling, multiplying vectors and matrices
Origins go back to 1979, written in Fortran

LAPACK

Linear Algebra Package
Higher level functionality building on BLAS.
Linear solvers, eigenvalues, and matrix decompositions
Origins go back to 1992, mostly Fortran (expanded on LINPACK, EISPACK)

Modern variants?

Most default BLAS and LAPACK implementations (like R's defaults) are somewhat dated

Written in Fortran and designed for a single cpu core
Certain (potentially non-optimal) hard coded defaults (e.g. block size).

Multithreaded alternatives:

ATLAS - Automatically Tuned Linear Algebra Software
OpenBLAS - fork of GotoBLAS from TACC at UTexas
Intel MKL - Math Kernel Library, part of Intel's commercial compiler tools
cuBLAS / Magma - GPU libraries from NVidia and UTK respectively

OpenBLAS DGEMM (Matrix Multiply) Performance

n	1 core	2 cores	4 cores	8 cores
100	0.001	0.001	0.000	0.000
500	0.018	0.011	0.008	0.008
1000	0.128	0.068	0.041	0.036
2000	0.930	0.491	0.276	0.162
3000	3.112	1.604	0.897	0.489
4000	7.330	3.732	1.973	1.188
5000	14.223	7.341	3.856	2.310

GPU vs OpenBLAS (NVidia K20X - Kepler)

M	N	K	MAGMA (ms)	cuBLAS (ms)	CPU (ms)
1088	1088	1088	4.68	5.98	69.54
2112	2112	2112	29.79	17.33	329.42
3136	3136	3136	98.68	54.17	958.82
4160	4160	4160	230.35	125.54	2119.71
5184	5184	5184	448.66	240.91	3998.74
6208	6208	6208	772.88	412.61	6801.51
7232	7232	7232	1224.33	650.76	10629.33
8256	8256	8256	1823.29	967.20	15740.96
9280	9280	9280	2590.17	1370.57	22367.01
10304	10304	10304	3517.43	1872.64	30597.85

GPU vs OpenBLAS (NVidia P100 - Pascal)

M	N	K	MAGMA (ms)	cuBLAS (ms)	CPU (ms)
1088	1088	1088	0.94	0.72	64.82
2112	2112	2112	5.60	4.71	154.75
3136	3136	3136	17.71	13.82	357.92
4160	4160	4160	41.45	32.43	741.42
5184	5184	5184	80.21	62.28	1293.62
6208	6208	6208	138.34	106.78	2223.03
7232	7232	7232	218.60	167.36	3413.69
8256	8256	8256	326.42	249.71	4892.46
9280	9280	9280	465.48	353.87	6999.95
10304	10304	10304	638.43	483.86	9547.43