Part of the base packages in R
tools for the forking of R processes (some functions do not work on Windows)
Core functions:
detectCores
pvec
mclapply
mcparallel
& mccollect
Surprisingly, detects the number of cores of the current system.
detectCores() ## [1] 24
Parallelization of a vectorized function call
system.time(pvec(1:1e7, sqrt, mc.cores = 1)) ## user system elapsed ## 0.214 0.029 0.243 system.time(pvec(1:1e7, sqrt, mc.cores = 4)) ## user system elapsed ## 0.442 0.185 0.631 system.time(pvec(1:1e7, sqrt, mc.cores = 8)) ## user system elapsed ## 0.532 0.389 0.372
sapply(c(1,2,4,8,16,24), function(x) { sapply(6:9, function(y) system.time(pvec(1:(10^y), sqrt, mc.cores=x))[3] ) }) ## [,1] [,2] [,3] [,4] [,5] [,6] ## elapsed 0.079 0.063 0.055 0.058 0.084 0.092 ## elapsed 0.334 0.442 0.393 0.350 0.530 0.623 ## elapsed 2.159 4.508 3.761 3.456 3.633 3.741 ## elapsed 18.562 24.887 39.576 33.213 33.622 34.600
Parallelized version of lapply
system.time(rnorm(1e6)) ## user system elapsed ## 0.101 0.007 0.107 system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 2))) ## user system elapsed ## 0.148 0.136 0.106 system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4))) ## user system elapsed ## 0.242 0.061 0.052 ```
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4))) ## user system elapsed ## 0.097 0.047 0.079 system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 5))) ## user system elapsed ## 0.110 0.102 0.072 system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 8))) ## user system elapsed ## 0.193 0.076 0.040 system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 10))) ## user system elapsed ## 0.162 0.083 0.041 system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 12))) ## user system elapsed ## 0.098 0.065 0.037
Asynchronously evaluation of an R expression in a separate process
m = mcparallel(rnorm(1e6)) n = mcparallel(rbeta(1e6,1,1)) o = mcparallel(rgamma(1e6,1,1)) str(m)
## List of 2 ## $ pid: int 49291 ## $ fd : int [1:2] 5 8 ## - attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process"
str(n)
## List of 2 ## $ pid: int 49292 ## $ fd : int [1:2] 6 10 ## - attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process"
Checks mcparallel
objects for completion
str(mccollect(list(m,n,o)))
## List of 3 ## $ 49291: num [1:1000000] -0.487 1.206 -1.9 -0.333 1.357 ... ## $ 49292: num [1:1000000] 0.51 0.667 0.762 0.38 0.577 ... ## $ 49293: num [1:1000000] 1.282 0.467 1.763 0.117 0.629 ...
p = mcparallel(mean(rnorm(1e5))) mccollect(p, wait = FALSE, 10) # will retrieve the result (since it's fast)
## [[1]] ## [1] 0.0008064199
mccollect(p, wait = FALSE) # will signal the job as terminating
## [[1]] ## NULL
mccollect(p, wait = FALSE) # there is no longer such a job
## NULL
Packages by Revolution Analytics that provides the foreach
function which is a parallelizable for
loop (and then some).
Core functions:
registerDoMC
foreach
, %dopar%
, %do%
Primarily used to set the number of cores used by foreach
, by default uses options("cores")
or half the number of cores found by detectCores
from the parallel package.
options("cores") ## $cores ## NULL detectCores() ## [1] 24 getDoParWorkers() ## [1] 1 registerDoMC(4) getDoParWorkers() ## [1] 4
A slightly more powerful version of base for
loops (think for
with an lapply
flavor). Combined with %do%
or %dopar%
for single or multicore execution.
for(i in 1:10) sqrt(i) foreach(i = 1:5) %do% sqrt(i)
## [[1]] ## [1] 1 ## ## [[2]] ## [1] 1.414214 ## ## [[3]] ## [1] 1.732051 ## ## [[4]] ## [1] 2 ## ## [[5]] ## [1] 2.236068
foreach
can iterate across more than one value
foreach(i = 1:5, j = 1:5) %do% sqrt(i^2+j^2)
## [[1]] ## [1] 1.414214 ## ## [[2]] ## [1] 2.828427 ## ## [[3]] ## [1] 4.242641 ## ## [[4]] ## [1] 5.656854 ## ## [[5]] ## [1] 7.071068
foreach(i = 1:5, j = 1:2) %do% sqrt(i^2+j^2)
## [[1]] ## [1] 1.414214 ## ## [[2]] ## [1] 2.828427
foreach(i = 1:5, .combine='c') %do% sqrt(i)
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068
foreach(i = 1:5, .combine='cbind') %do% sqrt(i)
## result.1 result.2 result.3 result.4 result.5 ## [1,] 1 1.414214 1.732051 2 2.236068
foreach(i = 1:5, .combine='+') %do% sqrt(i)
## [1] 8.382332
Swapping out %do%
for %dopar%
will use the parallel backend.
registerDoMC(4) system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))
## user system elapsed ## 0.886 0.075 0.431
registerDoMC(8) system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))
## user system elapsed ## 1.147 0.108 0.596
registerDoMC(12) system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))
## user system elapsed ## 1.571 0.159 0.449
Optimal use of multiple cores is hard, there isn't one best solution
Don't underestimate the overhead cost
More art than science - experimentation is key
Measure it or it didn't happen
Be aware of the trade off between developer time and run time
An awful lot of statistics is at its core linear algebra.
For example:
\[ \hat{\beta} = (X^T X)^{-1} X^Ty \]
Principle component analysis
Find \(T = XW\) where \(W\) is a matrix whose columns are the eigenvectors of \(X^TX\).
Often solved via SVD - Let \(X = U\Sigma W^T\) then \(T = U\Sigma\).
Not unique to Statistics, these are the type of problems that come up across all areas of numerical computing.
Efficiency and stability of numerical algorithms matter
Low level algorithms for common linear algebra operations
BLAS
Basic Linear Algebra Subprograms
Copying, scaling, multiplying vectors and matrices
Origins go back to 1979, written in Fortran
LAPACK
Linear Algebra Package
Higher level functionality building on BLAS.
Linear solvers, eigenvalues, and matrix decompositions
Origins go back to 1992, mostly Fortran (expanded on LINPACK, EISPACK)
cr173@saxon [~]$ ldd /usr/lib64/R/bin/exec/R linux-vdso.so.1 => (0x00007fff922a3000) libR.so => /usr/lib64/R/lib/libR.so (0x000000378d400000) libRblas.so => /usr/lib64/R/lib/libRblas.so (0x000000378da00000) libgomp.so.1 => /usr/lib64/libgomp.so.1 (0x000000378c000000) libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003787c00000) libc.so.6 => /lib64/libc.so.6 (0x0000003787400000) libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x000000378cc00000) libm.so.6 => /lib64/libm.so.6 (0x0000003787800000) libreadline.so.6 => /lib64/libreadline.so.6 (0x000000378f400000) librt.so.1 => /lib64/librt.so.1 (0x0000003788800000) libdl.so.2 => /lib64/libdl.so.2 (0x0000003788000000) libicuuc.so.42 => /usr/lib64/libicuuc.so.42 (0x0000003796a00000) libicui18n.so.42 => /usr/lib64/libicui18n.so.42 (0x0000003796600000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003789000000) /lib64/ld-linux-x86-64.so.2 (0x0000003787000000) libtinfo.so.5 => /lib64/libtinfo.so.5 (0x000000378d000000) libicudata.so.42 => /usr/lib64/libicudata.so.42 (0x0000003795400000) libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003789400000)
Most default BLAS and LAPACK implementations (like R's defaults) are somewhat dated
Designed for a single core cpu
Certain (potentially non-optimal) hard coded defaults (e.g. block size).
Multithreaded alternatives:
ATLAS - Automatically Tuned Linear Algebra Software
OpenBLAS - fork of GotoBLAS from TACC at UTexas
Intel MKL - Math Kernel Library, part of Intel's commercial compiler tools
cuBLAS / Magma - hybrid CPU / GPU library from the ICL at UTK
A BLAS and LAPACK subroutines are named using form pmmaaa
where:
p
is a one letter code for the type of data
S
single precision floating pointD
double precision floating pointC
complex single precision floating pointZ
complex double precision floating pointmm
is a two letter code for the type of matrix expected by the subroutine
aaa
is a one to three letter code denoting the algorithm implemented by subroutine
D
- type double, GE
- general matrix, MM
- matrix / matrix multiplication.
dgemm( character TRANSA, character TRANSB, integer M, integer N, integer K, double precision ALPHA, double precision, dimension(lda,*) A, integer LDA, double precision, dimension(ldb,*) B, integer LDB, double precision BETA, double precision, dimension(ldc,*) C, integer LDC )
DGEMM
performs one of the matrix-matrix operations
\[C = \alpha op( A ) \times op( B ) + \beta C\]
where \(op( X )\) is either \(op( X ) = X\) or \(op( X ) = X^T\), \(\alpha\) and \(\beta\) are scalars, and \(A\), \(B\) and \(C\) are matrices, with \(op( A )\) an \(m\) by \(k\) matrix, \(op( B )\) a \(k\) by \(n\) matrix and \(C\) an \(m\) by \(n\) matrix.
DPOTRF
D
- type double, PO
- positive definite matrix, TRF
- triangular factorization
dpotrf( character UPLO, integer N, double precision, dimension( lda, * ) A, integer LDA, integer INFO )
DPOTRF
computes the Cholesky factorization of a real symmetric positive definite matrix \(A\).
The factorization has the form \[A = U^T * U, \text{if UPLO = 'U', or}\] \[A = L * L^T, \text{if UPLO = 'L',}\] where \(U\) is an upper triangular matrix and \(L\) is lower triangular.
library(OpenBlasThreads) x=matrix(runif(5000^2),ncol=5000) sizes = c(100,500,1000,2000,3000,4000,5000) cores = c(1,2,4,8) sapply(cores, function(c) { openblas.set.num.threads(c) sapply(sizes, function(s) { y = x[1:s,1:s] system.time(y %*% y)[3] }) })
n | 1 core | 2 cores | 4 cores | 8 cores |
---|---|---|---|---|
100 | 0.001 | 0.001 | 0.000 | 0.000 |
500 | 0.018 | 0.011 | 0.008 | 0.008 |
1000 | 0.128 | 0.068 | 0.041 | 0.036 |
2000 | 0.930 | 0.491 | 0.276 | 0.162 |
3000 | 3.112 | 1.604 | 0.897 | 0.489 |
4000 | 7.330 | 3.732 | 1.973 | 1.188 |
5000 | 14.223 | 7.341 | 3.856 | 2.310 |
Base BLAS takes ~91 secs for a 5000 x 5000 matrix.
library(OpenBlasThreads) library(clusterGeneration) x=genPositiveDefMat(5000)$Sigma sizes = c(100,500,1000,2000,3000,4000,5000) cores = c(1,2,4,8) sapply(cores, function(c) { openblas.set.num.threads(c) sapply(sizes, function(s) { y = x[1:s,1:s] system.time(chol(y))[3] }) })
n | 1 core | 2 cores | 4 cores | 8 cores |
---|---|---|---|---|
100 | 0.000 | 0.001 | 0.000 | 0.000 |
500 | 0.006 | 0.004 | 0.003 | 0.002 |
1000 | 0.031 | 0.023 | 0.013 | 0.010 |
2000 | 0.211 | 0.135 | 0.074 | 0.049 |
3000 | 0.666 | 0.401 | 0.212 | 0.157 |
4000 | 1.530 | 0.903 | 0.460 | 0.283 |
5000 | 2.593 | 1.749 | 0.899 | 0.565 |
Base BLAS takes ~26 secs for a 5000 x 5000 matrix.