Profiling & Parallelization


Profiling & Benchmarking

Profiling & Benchmarking

  • Improved performance comes from iteration, and learning the most common pitfalls

  • Don’t sweat the small stuff - Coder time vs Run time vs Compute costs

  • Measure it, or it didn’t happen

  • “Premature optimization is the root of all evil (or at least most of it) in programming.” -Knuth

How do we measure?

Simplest tool is R’s base system.time which can be used to wrap any other call or calls.

system.time(rnorm(1e6))
##    user  system elapsed 
##   0.134   0.004   0.138
system.time(rnorm(1e4) %*% t(rnorm(1e4)))
##    user  system elapsed 
##   0.643   0.278   0.627

Better benchmarking (pt. 1)

We can do better (better precision) using the microbenchmark package

install.packages("microbenchmark")
library(microbenchmark)

d = abs(rnorm(1000))
r = microbenchmark(
      exp(log(d)/2),
      d^0.5,
      sqrt(d),
      times = 1000
    )
print(r)
## Unit: microseconds
##           expr    min      lq      mean  median      uq     max neval
##  exp(log(d)/2) 19.402 22.0225 26.139719 23.3335 24.9005 125.384  1000
##          d^0.5 28.470 31.9075 37.092019 34.1910 35.3425 179.608  1000
##        sqrt(d)  3.018  6.1065  8.072531  6.5120  7.6350 141.061  1000

boxplot(r)

Better benchmarking (pt. 2)

We can also do better using the rbenchmark package

install.packages("rbenchmark")
library(rbenchmark)

d = abs(rnorm(1000))
benchmark(
  exp(log(d)/2),
  d^0.5,
  sqrt(d),
  replications = 1000,
  order = "relative"
)
##            test replications elapsed relative user.self sys.self user.child sys.child
## 3       sqrt(d)         1000   0.007    1.000     0.007    0.000          0         0
## 1 exp(log(d)/2)         1000   0.027    3.857     0.026    0.001          0         0
## 2         d^0.5         1000   0.034    4.857     0.034    0.001          0         0

Exercise 1

Earlier we mentioned that growing a vector as you collect results is bad, just how bad is it? Benchmark the following three functions and compare their performance.

good = function()
{
    res = rep(NA, 1e4)
    for(i in seq_along(res))
    {
        res[i] = sqrt(i)
    }
}
bad = function()
{
    res = numeric()
    for(i in 1:1e4)
    {
        res = c(res,sqrt(i))
    }
}
best = function()
{
    sqrt(1:1e4)
}

Parallelization

parallel

Part of the base packages in R

  • tools for the forking of R processes (some functions do not work on Windows)

  • Core functions:

    • detectCores

    • pvec

    • mclapply

    • mcparallel & mccollect

detectCores

Surprisingly, detects the number of cores of the current system.

detectCores()

## [1] 24

pvec

Parallelization of a vectorized function call

system.time(pvec(1:1e7, sqrt, mc.cores = 1))

##   user  system elapsed 
##  0.214   0.029   0.243 

system.time(pvec(1:1e7, sqrt, mc.cores = 4))

##   user  system elapsed 
##  0.442   0.185   0.631 

system.time(pvec(1:1e7, sqrt, mc.cores = 8))

##   user  system elapsed 
##  0.532   0.389   0.372 

cores = c(1,2,4,8,16)
order = 6:8
res = map(
  cores, 
  function(x) {
     map_dbl(order, function(y) system.time(pvec(1:(10^y), sqrt, mc.cores=x))[3]) 
  }
) %>% do.call(rbind,.)
rownames(res) = paste0(cores," cores")
colnames(res) = paste0("10^",order)
res

##            10^6  10^7  10^8
##  1 cores  0.010 0.081 1.216
##  2 cores  0.060 0.426 5.223
##  4 cores  0.086 0.366 3.935
##  8 cores  0.175 0.422 3.573
##  16 cores 0.285 0.550 3.465

mclapply

Parallelized version of lapply

system.time(rnorm(1e6))

##   user  system elapsed 
##  0.101   0.007   0.107 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 2)))

##   user  system elapsed 
##  0.148   0.136   0.106 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4)))

##   user  system elapsed 
##  0.242   0.061   0.052 ```

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4)))

##   user  system elapsed 
##  0.097   0.047   0.079 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 8)))

##   user  system elapsed 
##  0.193   0.076   0.040 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 10)))

##   user  system elapsed 
##  0.162   0.083   0.041 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 12)))

##   user  system elapsed 
##  0.098   0.065   0.037 

mcparallel

Asynchronously evaluation of an R expression in a separate process

m = mcparallel(rnorm(1e6))
n = mcparallel(rbeta(1e6,1,1))
o = mcparallel(rgamma(1e6,1,1))

str(m)
## List of 2
##  $ pid: int 66439
##  $ fd : int [1:2] 5 8
##  - attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process"
str(n)
## List of 2
##  $ pid: int 66440
##  $ fd : int [1:2] 6 10
##  - attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process"

mccollect

Checks mcparallel objects for completion

str(mccollect(list(m,n,o)))
## List of 3
##  $ 66439: num [1:1000000] 0.9738 0.0506 -0.6864 -0.9432 0.4358 ...
##  $ 66440: num [1:1000000] 0.401 0.984 0.376 0.17 0.661 ...
##  $ 66441: num [1:1000000] 0.743 0.477 0.628 1.532 3.023 ...

mccollect - waiting

p = mcparallel(mean(rnorm(1e5)))
mccollect(p, wait = FALSE, 10) # will retrieve the result (since it's fast)
## $`66442`
## [1] 0.001247375
mccollect(p, wait = FALSE)     # will signal the job as terminating
## $`66442`
## NULL
mccollect(p, wait = FALSE)     # there is no longer such a job
## NULL

doMC & foreach

doMC & foreach

Packages by Revolution Analytics that provides the foreach function which is a parallelizable for loop (and then some).

  • Core functions:

    • registerDoMC

    • foreach, %dopar%, %do%

registerDoMC

Primarily used to set the number of cores used by foreach, by default uses options("cores") or half the number of cores found by detectCores from the parallel package.

options("cores")

## $cores
## NULL

detectCores()

## [1] 24

getDoParWorkers()

## [1] 1

registerDoMC(4)
getDoParWorkers()

## [1] 4

foreach

A slightly more powerful version of base for loops (think for with an lapply flavor). Combined with %do% or %dopar% for single or multicore execution.

for(i in 1:10) sqrt(i)

foreach(i = 1:5) %do% sqrt(i)   
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1.414214
## 
## [[3]]
## [1] 1.732051
## 
## [[4]]
## [1] 2
## 
## [[5]]
## [1] 2.236068

foreach - iterators

foreach can iterate across more than one value

foreach(i = 1:5, j = 1:5) %do% sqrt(i^2+j^2)   
## [[1]]
## [1] 1.414214
## 
## [[2]]
## [1] 2.828427
## 
## [[3]]
## [1] 4.242641
## 
## [[4]]
## [1] 5.656854
## 
## [[5]]
## [1] 7.071068
foreach(i = 1:5, j = 1:2) %do% sqrt(i^2+j^2)   
## [[1]]
## [1] 1.414214
## 
## [[2]]
## [1] 2.828427







foreach - combining results

foreach(i = 1:5, .combine='c') %do% sqrt(i)   
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068
foreach(i = 1:5, .combine='cbind') %do% sqrt(i)   
##      result.1 result.2 result.3 result.4 result.5
## [1,]        1 1.414214 1.732051        2 2.236068
foreach(i = 1:5, .combine='+') %do% sqrt(i)   
## [1] 8.382332

foreach - parallelization

Swapping out %do% for %dopar% will use the parallel backend.

registerDoMC(4)
system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))
##    user  system elapsed 
##   0.438   0.038   0.340
registerDoMC(8)
system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))
##    user  system elapsed 
##   1.444   0.122   0.345
registerDoMC(12)
system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))
##    user  system elapsed 
##   1.174   0.113   0.324

Exercise 2 - Bootstraping

Bootstrapping is a resampling scheme where the original data is repeatedly reconstructed by taking a sample (with replacement) of the same size as the original data, and using that to conduct whatever analysis procedure is of interest. Below is an example of fitting a local regression (loess) to some synthetic data, we will construct a bootstrap prediction interval for this model.

set.seed(3212016)
d = data.frame(x = 1:120) %>%
    mutate(y = sin(2*pi*x/120) + runif(length(x),-1,1))

l = loess(y ~ x, data=d)
d$pred_y = predict(l)
d$pred_y_se = predict(l,se=TRUE)$se.fit

ggplot(d, aes(x,y)) +
  geom_point() +
  geom_line(aes(y=pred_y)) +
  geom_line(aes(y=pred_y + 1.96 * pred_y_se), color="red") +
  geom_line(aes(y=pred_y - 1.96 * pred_y_se), color="red")

Exercise 2 - Cont.

Re-implement the code below using one of the parallelization techniques we have just discussed, check your performance in creating the bootstrap sample using for 1, 2, and 4 cores. (Work with a neighbor so you are not all running MC code at the same time)

n_rep = 10000
res = matrix(NA, ncol=n_rep, nrow=nrow(d))

for(i in 1:ncol(res))
{ 
  bootstrap_samp = d %>% select(x,y) %>% sample_n(nrow(d), replace=TRUE)
  res[,i] = predict(loess(y ~ x, data=bootstrap_samp), newdata=d)
}

# Calculate the 95% bootstrap prediction interval
d$bs_low = apply(res,1,quantile,probs=c(0.025), na.rm=TRUE)
d$bs_up  = apply(res,1,quantile,probs=c(0.975), na.rm=TRUE)

ggplot(d, aes(x,y)) +
  geom_point() +
  geom_line(aes(y=pred_y)) +
  geom_line(aes(y=pred_y + 1.96 * pred_y_se), color="red") +
  geom_line(aes(y=pred_y - 1.96 * pred_y_se), color="red") +
  geom_line(aes(y=bs_low), color="blue") +
  geom_line(aes(y=bs_up), color="blue")

What to use when?

Optimal use of multiple cores is hard, there isn’t one best solution

  • Don’t underestimate the overhead cost

  • More art than science - experimentation is key

  • Measure it or it didn’t happen

  • Be aware of the trade off between developer time and run time

BLAS and LAPACK

Statistics and Linear Algebra

An awful lot of statistics is at its core linear algebra.


For example:

  • Linear regession models, find

\[ \hat{\beta} = (X^T X)^{-1} X^Ty \]

  • Principle component analysis

    • Find \(T = XW\) where \(W\) is a matrix whose columns are the eigenvectors of \(X^TX\).

    • Often solved via SVD - Let \(X = U\Sigma W^T\) then \(T = U\Sigma\).

Numerical Linear Algebra

Not unique to Statistics, these are the type of problems that come up across all areas of numerical computing.

  • Numerical linear algebra \(\ne\) mathematical linear algebra


  • Efficiency and stability of numerical algorithms matter

    • Designing and implementing these algorithms is hard


  • Don’t reinvent the wheel - common core linear algebra tools (well defined API)

BLAS and LAPACK

Low level algorithms for common linear algebra operations


BLAS

  • Basic Linear Algebra Subprograms

  • Copying, scaling, multiplying vectors and matrices

  • Origins go back to 1979, written in Fortran


LAPACK

  • Linear Algebra Package

  • Higher level functionality building on BLAS.

  • Linear solvers, eigenvalues, and matrix decompositions

  • Origins go back to 1992, mostly Fortran (expanded on LINPACK, EISPACK)

Modern variants?

Most default BLAS and LAPACK implementations (like R’s defaults) are somewhat dated

  • Designed for a single cpu core

  • Certain (potentially non-optimal) hard coded defaults (e.g. block size).


Multithreaded alternatives:

  • ATLAS - Automatically Tuned Linear Algebra Software

  • OpenBLAS - fork of GotoBLAS from TACC at UTexas

  • Intel MKL - Math Kernel Library, part of Intel’s commercial compiler tools

  • cuBLAS / Magma - hybrid CPU / GPU library from UTK

Naming conventions

BLAS and LAPACK subroutines are named using form pmmaaa where:

  • p is a one letter code for the type of data

    • S single precision floating point
    • D double precision floating point
    • C complex single precision floating point
    • Z complex double precision floating point
  • mm is a two letter code for the type of matrix expected by the subroutine

  • aaa is a one to three letter code denoting the algorithm implemented by subroutine

BLAS Example - DGEMM

D - type double, GE - general matrix, MM - matrix / matrix multiplication.

dgemm(   character   TRANSA,
         character   TRANSB,
         integer     M,
         integer     N,
         integer     K,
         double precision    ALPHA,
         double precision, dimension(lda,*)  A,
         integer     LDA,
         double precision, dimension(ldb,*)  B,
         integer     LDB,
         double precision    BETA,
         double precision, dimension(ldc,*)  C,
         integer     LDC 
     )   

DGEMM performs one of the matrix-matrix operations

\[C = \alpha op( A ) \times op( B ) + \beta C\]

where \(op( X )\) is either \(op( X ) = X\) or \(op( X ) = X^T\), \(\alpha\) and \(\beta\) are scalars, and \(A\), \(B\) and \(C\) are matrices, with \(op( A )\) an \(m\) by \(k\) matrix, \(op( B )\) a \(k\) by \(n\) matrix and \(C\) an \(m\) by \(n\) matrix.

OpenBLAS DGEMM Performance

library(RhpcBLASctl)
x=matrix(runif(5000^2),ncol=5000)

sizes = c(100,500,1000,2000,3000,4000,5000)
cores = c(1,2,4,8)

sapply(
  cores, 
  function(n_cores) 
  {
    blas_set_num_threads(n_cores)
    sapply(
      sizes, 
      function(s) 
      {
           y = x[1:s,1:s]
           system.time(y %*% y)[3]
      }
    )
  }
)

n 1 core 2 cores 4 cores 8 cores
100 0.001 0.001 0.000 0.000
500 0.018 0.011 0.008 0.008
1000 0.128 0.068 0.041 0.036
2000 0.930 0.491 0.276 0.162
3000 3.112 1.604 0.897 0.489
4000 7.330 3.732 1.973 1.188
5000 14.223 7.341 3.856 2.310

GPU vs OpenBLAS

M N K MAGMA Gflop/s (ms) cuBLAS Gflop/s (ms) CPU Gflop/s (ms)
1088 1088 1088 550.51 ( 4.68) 430.65 ( 5.98) 37.04 ( 69.54)
2112 2112 2112 632.45 ( 29.79) 1086.90 ( 17.33) 57.20 ( 329.42)
3136 3136 3136 625.10 ( 98.68) 1138.67 ( 54.17) 64.33 ( 958.82)
4160 4160 4160 625.07 ( 230.35) 1146.94 ( 125.54) 67.93 (2119.71)
5184 5184 5184 621.02 ( 448.66) 1156.58 ( 240.91) 69.68 (3998.74)
6208 6208 6208 619.12 ( 772.88) 1159.71 ( 412.61) 70.35 (6801.51)
7232 7232 7232 617.88 (1224.33) 1162.48 ( 650.76) 71.17 (10629.33)
8256 8256 8256 617.28 (1823.29) 1163.65 ( 967.20) 71.50 (15740.96)
9280 9280 9280 617.09 (2590.17) 1166.20 (1370.57) 71.46 (22367.01)
10304 10304 10304 622.05 (3517.43) 1168.40 (1872.64) 71.51 (30597.85)