Parallelization

parallel

Part of the base packages in R

tools for the forking of R processes (some functions do not work on Windows)
Core functions:
- detectCores
- pvec
- mclapply
- mcparallel & mccollect

detectCores

Surprisingly, detects the number of cores of the current system.

detectCores()

## [1] 24

pvec

Parallelization of a vectorized function call

system.time(pvec(1:1e7, sqrt, mc.cores = 1))

##   user  system elapsed 
##  0.214   0.029   0.243 

system.time(pvec(1:1e7, sqrt, mc.cores = 4))

##   user  system elapsed 
##  0.442   0.185   0.631 

system.time(pvec(1:1e7, sqrt, mc.cores = 8))

##   user  system elapsed 
##  0.532   0.389   0.372

sapply(c(1,2,4,8,16,24), 
       function(x) 
       {
            sapply(6:9, 
                   function(y)  
                        system.time(pvec(1:(10^y), sqrt, mc.cores=x))[3] 
            )
        })

##           [,1]   [,2]   [,3]   [,4]   [,5]   [,6]
## elapsed  0.079  0.063  0.055  0.058  0.084  0.092
## elapsed  0.334  0.442  0.393  0.350  0.530  0.623
## elapsed  2.159  4.508  3.761  3.456  3.633  3.741
## elapsed 18.562 24.887 39.576 33.213 33.622 34.600

mclapply

Parallelized version of lapply

system.time(rnorm(1e6))

##   user  system elapsed 
##  0.101   0.007   0.107 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 2)))

##   user  system elapsed 
##  0.148   0.136   0.106 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4)))

##   user  system elapsed 
##  0.242   0.061   0.052 ```

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4)))

##   user  system elapsed 
##  0.097   0.047   0.079 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 5)))

##   user  system elapsed 
##  0.110   0.102   0.072 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 8)))

##   user  system elapsed 
##  0.193   0.076   0.040 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 10)))

##   user  system elapsed 
##  0.162   0.083   0.041 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 12)))

##   user  system elapsed 
##  0.098   0.065   0.037

mcparallel

Asynchronously evaluation of an R expression in a separate process

m = mcparallel(rnorm(1e6))
n = mcparallel(rbeta(1e6,1,1))
o = mcparallel(rgamma(1e6,1,1))

str(m)

## List of 2
##  $ pid: int 49291
##  $ fd : int [1:2] 5 8
##  - attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process"

str(n)

## List of 2
##  $ pid: int 49292
##  $ fd : int [1:2] 6 10
##  - attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process"

mccollect

Checks mcparallel objects for completion

str(mccollect(list(m,n,o)))

## List of 3
##  $ 49291: num [1:1000000] -0.487 1.206 -1.9 -0.333 1.357 ...
##  $ 49292: num [1:1000000] 0.51 0.667 0.762 0.38 0.577 ...
##  $ 49293: num [1:1000000] 1.282 0.467 1.763 0.117 0.629 ...

mccollect - waiting

p = mcparallel(mean(rnorm(1e5)))
mccollect(p, wait = FALSE, 10) # will retrieve the result (since it's fast)

## [[1]]
## [1] 0.0008064199

mccollect(p, wait = FALSE)     # will signal the job as terminating

## [[1]]
## NULL

mccollect(p, wait = FALSE)     # there is no longer such a job

## NULL

doMC & foreach

Packages by Revolution Analytics that provides the foreach function which is a parallelizable for loop (and then some).

Core functions:
- registerDoMC
- foreach, %dopar%, %do%

registerDoMC

Primarily used to set the number of cores used by foreach, by default uses options("cores") or half the number of cores found by detectCores from the parallel package.

options("cores")

## $cores
## NULL

detectCores()

## [1] 24

getDoParWorkers()

## [1] 1

registerDoMC(4)
getDoParWorkers()

## [1] 4

foreach

A slightly more powerful version of base for loops (think for with an lapply flavor). Combined with %do% or %dopar% for single or multicore execution.

for(i in 1:10) sqrt(i)

foreach(i = 1:5) %do% sqrt(i)

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1.414214
## 
## [[3]]
## [1] 1.732051
## 
## [[4]]
## [1] 2
## 
## [[5]]
## [1] 2.236068

foreach - iterators

foreach can iterate across more than one value

foreach(i = 1:5, j = 1:5) %do% sqrt(i^2+j^2)

## [[1]]
## [1] 1.414214
## 
## [[2]]
## [1] 2.828427
## 
## [[3]]
## [1] 4.242641
## 
## [[4]]
## [1] 5.656854
## 
## [[5]]
## [1] 7.071068

foreach(i = 1:5, j = 1:2) %do% sqrt(i^2+j^2)

## [[1]]
## [1] 1.414214
## 
## [[2]]
## [1] 2.828427

foreach - combining results

foreach(i = 1:5, .combine='c') %do% sqrt(i)

## [1] 1.000000 1.414214 1.732051 2.000000 2.236068

foreach(i = 1:5, .combine='cbind') %do% sqrt(i)

##      result.1 result.2 result.3 result.4 result.5
## [1,]        1 1.414214 1.732051        2 2.236068

foreach(i = 1:5, .combine='+') %do% sqrt(i)

## [1] 8.382332

foreach - parallelization

Swapping out %do% for %dopar% will use the parallel backend.

registerDoMC(4)
system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))

##    user  system elapsed 
##   0.886   0.075   0.431

registerDoMC(8)
system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))

##    user  system elapsed 
##   1.147   0.108   0.596

registerDoMC(12)
system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))

##    user  system elapsed 
##   1.571   0.159   0.449

What to use when?

Optimal use of multiple cores is hard, there isn't one best solution

Don't underestimate the overhead cost
More art than science - experimentation is key
Measure it or it didn't happen
Be aware of the trade off between developer time and run time

BLAS and LAPACK

Statistics and Linear Algebra

An awful lot of statistics is at its core linear algebra.

For example:

Linear regession models, find

\[ \hat{\beta} = (X^T X)^{-1} X^Ty \]

Principle component analysis
- Find \(T = XW\) where \(W\) is a matrix whose columns are the eigenvectors of \(X^TX\).
- Often solved via SVD - Let \(X = U\Sigma W^T\) then \(T = U\Sigma\).

Numerical Linear Algebra

Not unique to Statistics, these are the type of problems that come up across all areas of numerical computing.

Numerical linear algebra \(\ne\) mathematical linear algebra

Efficiency and stability of numerical algorithms matter
- Designing and implementing these algorithms is hard

Don't reinvent the wheel - common core linear algebra tools (well defined API)

BLAS and LAPACK

Low level algorithms for common linear algebra operations

BLAS

Basic Linear Algebra Subprograms
Copying, scaling, multiplying vectors and matrices
Origins go back to 1979, written in Fortran

LAPACK

Linear Algebra Package
Higher level functionality building on BLAS.
Linear solvers, eigenvalues, and matrix decompositions
Origins go back to 1992, mostly Fortran (expanded on LINPACK, EISPACK)

R + BLAS/LAPACK

cr173@saxon [~]$ ldd /usr/lib64/R/bin/exec/R 
        linux-vdso.so.1 =>  (0x00007fff922a3000)
        libR.so => /usr/lib64/R/lib/libR.so (0x000000378d400000)
        libRblas.so => /usr/lib64/R/lib/libRblas.so (0x000000378da00000)
        libgomp.so.1 => /usr/lib64/libgomp.so.1 (0x000000378c000000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003787c00000)
        libc.so.6 => /lib64/libc.so.6 (0x0000003787400000)
        libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x000000378cc00000)
        libm.so.6 => /lib64/libm.so.6 (0x0000003787800000)
        libreadline.so.6 => /lib64/libreadline.so.6 (0x000000378f400000)
        librt.so.1 => /lib64/librt.so.1 (0x0000003788800000)
        libdl.so.2 => /lib64/libdl.so.2 (0x0000003788000000)
        libicuuc.so.42 => /usr/lib64/libicuuc.so.42 (0x0000003796a00000)
        libicui18n.so.42 => /usr/lib64/libicui18n.so.42 (0x0000003796600000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003789000000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003787000000)
        libtinfo.so.5 => /lib64/libtinfo.so.5 (0x000000378d000000)
        libicudata.so.42 => /usr/lib64/libicudata.so.42 (0x0000003795400000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003789400000)

Modern variants?

Most default BLAS and LAPACK implementations (like R's defaults) are somewhat dated

Designed for a single core cpu
Certain (potentially non-optimal) hard coded defaults (e.g. block size).

Multithreaded alternatives:

ATLAS - Automatically Tuned Linear Algebra Software
OpenBLAS - fork of GotoBLAS from TACC at UTexas
Intel MKL - Math Kernel Library, part of Intel's commercial compiler tools
cuBLAS / Magma - hybrid CPU / GPU library from the ICL at UTK

Naming conventions

A BLAS and LAPACK subroutines are named using form pmmaaa where:

p is a one letter code for the type of data
- S single precision floating point
- D double precision floating point
- C complex single precision floating point
- Z complex double precision floating point
mm is a two letter code for the type of matrix expected by the subroutine
aaa is a one to three letter code denoting the algorithm implemented by subroutine

BLAS Example - DGEMM

D - type double, GE - general matrix, MM - matrix / matrix multiplication.

dgemm(   character   TRANSA,
         character   TRANSB,
         integer     M,
         integer     N,
         integer     K,
         double precision    ALPHA,
         double precision, dimension(lda,*)  A,
         integer     LDA,
         double precision, dimension(ldb,*)  B,
         integer     LDB,
         double precision    BETA,
         double precision, dimension(ldc,*)  C,
         integer     LDC 
     )

DGEMM performs one of the matrix-matrix operations

\[C = \alpha op( A ) \times op( B ) + \beta C\]

where \(op( X )\) is either \(op( X ) = X\) or \(op( X ) = X^T\), \(\alpha\) and \(\beta\) are scalars, and \(A\), \(B\) and \(C\) are matrices, with \(op( A )\) an \(m\) by \(k\) matrix, \(op( B )\) a \(k\) by \(n\) matrix and \(C\) an \(m\) by \(n\) matrix.

LAPACK Example - `DPOTRF`

D - type double, PO - positive definite matrix, TRF - triangular factorization

dpotrf(  character   UPLO,
         integer     N,
         double precision, dimension( lda, * )   A,
         integer     LDA,
         integer     INFO 
      )

DPOTRF computes the Cholesky factorization of a real symmetric positive definite matrix \(A\).

The factorization has the form \[A = U^T * U, \text{if UPLO = 'U', or}\] \[A = L * L^T, \text{if UPLO = 'L',}\] where \(U\) is an upper triangular matrix and \(L\) is lower triangular.

OpenBLAS DGEMM Performance

library(OpenBlasThreads)
x=matrix(runif(5000^2),ncol=5000)

sizes = c(100,500,1000,2000,3000,4000,5000)
cores = c(1,2,4,8)

sapply(cores, 
       function(c) 
       {
            openblas.set.num.threads(c)
            sapply(sizes, 
                   function(s) 
                   {
                        y = x[1:s,1:s]
                        system.time(y %*% y)[3]
                   })
       })

n	1 core	2 cores	4 cores	8 cores
100	0.001	0.001	0.000	0.000
500	0.018	0.011	0.008	0.008
1000	0.128	0.068	0.041	0.036
2000	0.930	0.491	0.276	0.162
3000	3.112	1.604	0.897	0.489
4000	7.330	3.732	1.973	1.188
5000	14.223	7.341	3.856	2.310

Base BLAS takes ~91 secs for a 5000 x 5000 matrix.

OpenBLAS DPOTRF Performance

library(OpenBlasThreads)
library(clusterGeneration)
x=genPositiveDefMat(5000)$Sigma

sizes = c(100,500,1000,2000,3000,4000,5000)
cores = c(1,2,4,8)

sapply(cores, 
       function(c) 
       {
            openblas.set.num.threads(c)
            sapply(sizes, 
                   function(s) 
                   {
                        y = x[1:s,1:s]
                        system.time(chol(y))[3]
                   })
       })

n	1 core	2 cores	4 cores	8 cores
100	0.000	0.001	0.000	0.000
500	0.006	0.004	0.003	0.002
1000	0.031	0.023	0.013	0.010
2000	0.211	0.135	0.074	0.049
3000	0.666	0.401	0.212	0.157
4000	1.530	0.903	0.460	0.283
5000	2.593	1.749	0.899	0.565

Base BLAS takes ~26 secs for a 5000 x 5000 matrix.

parallel

parallel

detectCores

pvec

mclapply

mcparallel

mccollect

mccollect - waiting

doMC & foreach

doMC & foreach

registerDoMC

foreach

foreach - iterators

foreach - combining results

foreach - parallelization

What to use when?

BLAS and LAPACK

Statistics and Linear Algebra

Numerical Linear Algebra

BLAS and LAPACK

R + BLAS/LAPACK

Modern variants?

Naming conventions

BLAS Example - DGEMM

LAPACK Example - DPOTRF

OpenBLAS DGEMM Performance

OpenBLAS DPOTRF Performance

LAPACK Example - `DPOTRF`