--- title: Parallelization author: "Colin Rundel" date: "2018-03-06" output: xaringan::moon_reader: css: "slides.css" lib_dir: libs nature: highlightStyle: github highlightLines: true countIncrementalSlides: false --- exclude: true ```{r, message=FALSE, warning=FALSE, include=FALSE} options( htmltools.dir.version = FALSE, # for blogdown width = 80, tibble.width = 80 ) knitr::opts_chunk$set( fig.align = "center" ) htmltools::tagList(rmarkdown::html_dependency_font_awesome()) ``` ```{r setup, echo=FALSE} library(dplyr) library(ggplot2) library(tidyr) library(parallel) library(foreach) library(doMC) ggplot2::theme_set(theme_bw()) ``` --- class: middle count: false # Parallelization --- ## `parallel` Part of the base packages in R * tools for the forking of R processes (some functions do not work on Windows) * Core functions: * `detectCores` * `pvec` * `mclapply` * `mcparallel` & `mccollect` --- ## `detectCores` Surprisingly, detects the number of cores of the current system. ```{r eval=FALSE} detectCores() ## [1] 24 ``` --- ## pvec Parallelization of a vectorized function call ```{r eval=FALSE} system.time(pvec(1:1e7, sqrt, mc.cores = 1)) ## user system elapsed ## 0.214 0.029 0.243 system.time(pvec(1:1e7, sqrt, mc.cores = 4)) ## user system elapsed ## 0.442 0.185 0.631 system.time(pvec(1:1e7, sqrt, mc.cores = 8)) ## user system elapsed ## 0.532 0.389 0.372 ``` --- ```{r eval=FALSE} cores = c(1,2,4,8,16) order = 6:8 res = map( cores, function(x) { map_dbl(order, function(y) system.time(pvec(1:(10^y), sqrt, mc.cores=x))[3]) } ) %>% do.call(rbind,.) rownames(res) = paste0(cores," cores") colnames(res) = paste0("10^",order) res ## 10^6 10^7 10^8 ## 1 cores 0.016 0.282 1.489 ## 2 cores 0.070 0.526 5.198 ## 4 cores 0.052 0.430 5.023 ## 8 cores 0.054 0.376 4.098 ## 16 cores 0.073 0.401 4.049 ``` --- ## mclapply Parallelized version of `lapply` ```{r eval=FALSE} system.time(rnorm(1e6)) ## user system elapsed ## 0.101 0.007 0.107 system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 2))) ## user system elapsed ## 0.148 0.136 0.106 system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4))) ## user system elapsed ## 0.242 0.061 0.052 ``` --- ```{r eval=FALSE} system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4))) ## user system elapsed ## 0.097 0.047 0.079 system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 8))) ## user system elapsed ## 0.193 0.076 0.040 system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 10))) ## user system elapsed ## 0.162 0.083 0.041 system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 12))) ## user system elapsed ## 0.098 0.065 0.037 ``` --- ## mcparallel Asynchronously evaluation of an R expression in a separate process ```{r} m = mcparallel(rnorm(1e6)) n = mcparallel(rbeta(1e6,1,1)) o = mcparallel(rgamma(1e6,1,1)) str(m) str(n) ``` --- ## mccollect Checks `mcparallel` objects for completion ```{r} str(mccollect(list(m,n,o))) ``` --- ## mccollect - waiting ```{r} p = mcparallel(mean(rnorm(1e5))) mccollect(p, wait = FALSE, 10) # will retrieve the result (since it's fast) mccollect(p, wait = FALSE) # will signal the job as terminating mccollect(p, wait = FALSE) # there is no longer such a job ``` --- class: middle count: false # doMC & foreach --- ## doMC & foreach Packages by Revolution Analytics that provides the `foreach` function which is a parallelizable `for` loop (and then some). * Core functions: * `registerDoMC` * `foreach`, `%dopar%`, `%do%` --- ## `registerDoMC` Primarily used to set the number of cores used by `foreach`, by default uses `options("cores")` or half the number of cores found by `detectCores` from the parallel package. ```{r eval=FALSE} options("cores") ## $cores ## NULL detectCores() ## [1] 24 getDoParWorkers() ## [1] 1 registerDoMC(4) getDoParWorkers() ## [1] 4 ``` --- ## `foreach` A slightly more powerful version of base `for` loops (think `for` with an `lapply` flavor). Combined with `%do%` or `%dopar%` for single or multicore execution. ```{r} for(i in 1:10) sqrt(i) foreach(i = 1:5) %do% sqrt(i) ``` --- ## `foreach` - iterators `foreach` can iterate across more than one value, but it doesn't do length coercion .pull-left[ ```{r} foreach(i = 1:5, j = 1:5) %do% sqrt(i^2+j^2) ``` ] .pull-right[ ```{r} foreach(i = 1:5, j = 1:2) %do% sqrt(i^2+j^2) ``` ] --- ## `foreach` - combining results ```{r} foreach(i = 1:5, .combine='c') %do% sqrt(i) foreach(i = 1:5, .combine='cbind') %do% sqrt(i) foreach(i = 1:5, .combine='+') %do% sqrt(i) ``` --- ## `foreach` - parallelization Swapping out `%do%` for `%dopar%` will use the parallel backend. ```{r} registerDoMC(4) ``` ```{r} system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6))) registerDoMC(8) system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6))) registerDoMC(12) system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6))) ``` --- ## Example - Bootstraping Bootstrapping is a resampling scheme where the original data is repeatedly reconstructed by taking a sample (with replacement) of the same size as the original data, and using that to conduct whatever analysis procedure is of interest. Below is an example of fitting a local regression (`loess`) to some synthetic data, we will construct a bootstrap prediction interval for this model. . . . ```{r} set.seed(3212016) d = data.frame(x = 1:120) %>% mutate(y = sin(2*pi*x/120) + runif(length(x),-1,1)) l = loess(y ~ x, data=d) d = d %>% mutate( pred_y = predict(l), pred_y_se = predict(l,se=TRUE)$se.fit ) %>% mutate( pred_low = pred_y - 1.96 * pred_y_se, pred_high = pred_y + 1.96 * pred_y_se ) ``` --- ```{r fig.align="center", fig.width=9, fig.height=6} ggplot(d, aes(x,y)) + geom_point(color="gray50") + geom_ribbon(aes(ymin=pred_low, ymax=pred_high), fill="red", alpha=0.25) + geom_line(aes(y=pred_y)) + theme_bw() ``` --- ## Example - Cont. We will now re-implement the code below using one of the parallelization techniques we have just discussed and will then check the performance with 1, 2, and 4 cores. ```{r eval=FALSE} n_rep = 5000 d_xy = select(d, x, y) res = map(1:n_rep, function(i) { d_xy %>% select(x,y) %>% sample_n(nrow(d), replace=TRUE) %>% loess(y ~ x, data=.) %>% predict(newdata=d) %>% setNames(NULL) }) %>% do.call(cbind, .) d = d %>% mutate( bs_low = apply(res,1,quantile,probs=c(0.025), na.rm=TRUE), bs_high = apply(res,1,quantile,probs=c(0.975), na.rm=TRUE) ) ``` --- ```{r eval=FALSE} ggplot(d, aes(x,y)) + geom_point(color="gray50") + geom_ribbon(aes(ymin=pred_low, ymax=pred_high), fill="red", alpha=0.25) + geom_ribbon(aes(ymin=bs_low, ymax=bs_high), fill="blue", alpha=0.25) + geom_line(aes(y=pred_y)) + theme_bw() ``` ```{r echo=FALSE} knitr::include_graphics("imgs/bootstrap_loess.png") ``` --- ## What to use when? Optimal use of multiple cores is hard, there isn't one best solution * Don't underestimate the overhead cost * More art than science - experimentation is key * Measure it or it didn't happen * Be aware of the trade off between developer time and run time --- class: middle count: false # BLAS and LAPACK --- ## Statistics and Linear Algebra An awful lot of statistics is at its core linear algebra.
For example: * Linear regession models, find $$ \hat{\beta} = (X^T X)^{-1} X^Ty $$ * Principle component analysis * Find $T = XW$ where $W$ is a matrix whose columns are the eigenvectors of $X^TX$. * Often solved via SVD - Let $X = U\Sigma W^T$ then $T = U\Sigma$. --- ## Numerical Linear Algebra Not unique to Statistics, these are the type of problems that come up across all areas of numerical computing. * Numerical linear algebra $\ne$ mathematical linear algebra
* Efficiency and stability of numerical algorithms matter * Designing and implementing these algorithms is hard
* Don't reinvent the wheel - common core linear algebra tools (well defined API) --- ## BLAS and LAPACK Low level algorithms for common linear algebra operations
BLAS * **B**asic **L**inear **A**lgebra **S**ubprograms * Copying, scaling, multiplying vectors and matrices * Origins go back to 1979, written in Fortran
LAPACK * **L**inear **A**lgebra **Pack**age * Higher level functionality building on BLAS. * Linear solvers, eigenvalues, and matrix decompositions * Origins go back to 1992, mostly Fortran (expanded on LINPACK, EISPACK) --- ## Modern variants? Most default BLAS and LAPACK implementations (like R's defaults) are somewhat dated * Written in Fortran and designed for a single cpu core * Certain (potentially non-optimal) hard coded defaults (e.g. block size).
Multithreaded alternatives: * ATLAS - Automatically Tuned Linear Algebra Software * OpenBLAS - fork of GotoBLAS from TACC at UTexas * Intel MKL - Math Kernel Library, part of Intel's commercial compiler tools * cuBLAS / Magma - GPU libraries from NVidia and UTK respectively --- ## OpenBLAS Matrix Multiply (DGEMM) Performance ```{r, eval=FALSE, include=FALSE} library(RhpcBLASctl) x=matrix(runif(5000^2),ncol=5000) sizes = c(100,500,1000,2000,3000,4000,5000) cores = c(1,2,4,8) sapply( cores, function(n_cores) { blas_set_num_threads(n_cores) sapply( sizes, function(s) { y = x[1:s,1:s] system.time(y %*% y)[3] } ) } ) ``` | n | 1 core | 2 cores | 4 cores | 8 cores | |------|--------|---------|---------|---------| | 100 | 0.001 | 0.001 | 0.000 | 0.000 | | 500 | 0.018 | 0.011 | 0.008 | 0.008 | | 1000 | 0.128 | 0.068 | 0.041 | 0.036 | | 2000 | 0.930 | 0.491 | 0.276 | 0.162 | | 3000 | 3.112 | 1.604 | 0.897 | 0.489 | | 4000 | 7.330 | 3.732 | 1.973 | 1.188 | | 5000 | 14.223 | 7.341 | 3.856 | 2.310 | --- ```{r echo=FALSE, fig.width=8} d = tribble( ~"n", ~"1 core", ~"2 cores", ~"4 cores", ~"8 cores", ~"K20X", ~"P100", 500, 0.018, 0.011, 0.008, 0.008, NA , NA , 1000, 0.128, 0.068, 0.041, 0.036, 0.00598, 0.00072, 2000, 0.930, 0.491, 0.276, 0.162, 0.01733, 0.00471, 3000, 3.112, 1.604, 0.897, 0.489, 0.05417, 0.01382, 4000, 7.330, 3.732, 1.973, 1.188, 0.12554, 0.03243, 5000, 14.223, 7.341, 3.856, 2.310, 0.24091, 0.06228 ) d %>% gather(cores,time,-n) %>% ggplot(aes(x=n, y=time, color=cores)) + geom_point() + geom_line() ``` --- ```{r echo=FALSE, fig.width=8} d %>% gather(cores,time,-n) %>% ggplot(aes(x=n, y=time, color=cores)) + geom_point() + geom_line() + scale_y_continuous(trans='log2') ```