class: center, middle, inverse, title-slide # Parallelization ## Statistical Computing & Programming ### Shawn Santo ### 06-17-20 --- ## Supplementary materials Companion videos - [Benchmarking and timing code](https://warpwire.duke.edu/w/Z9gDAA/) - [Functions `pvec()` and `mclapply()`](https://warpwire.duke.edu/w/adgDAA/) - [Functions `mcparallel()` and `mccollect()`](https://warpwire.duke.edu/w/cdgDAA/) Additional resources - [Multicore Data Science with R and Python](https://blog.dominodatalab.com/multicore-data-science-r-python/) - Beyond Single-Core R [slides](https://ljdursi.github.io/beyond-single-core-R/#/) by Jonathan Dursi - Getting started with `doMC` and `foreach` [vignette](https://cran.r-project.org/web/packages/doMC/vignettes/gettingstartedMC.pdf) by Steve Weston --- class: inverse, center, middle # Timing code --- ## Benchmarking with package `bench` .tiny[ ```r library(bench) ``` ] .tiny[ ```r x <- runif(n = 1000000) b <- bench::mark( sqrt(x), x ^ 0.5, x ^ (1 / 2), exp(log(x) / 2), time_unit = 's' ) b ``` ``` #> # A tibble: 4 x 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> <bch:expr> <dbl> <dbl> <dbl> <bch:byt> <dbl> #> 1 sqrt(x) 0.00196 0.00226 425. 7.63MB 217. #> 2 x^0.5 0.0185 0.0191 51.1 7.63MB 9.73 #> 3 x^(1/2) 0.0180 0.0186 53.5 7.63MB 9.73 #> 4 exp(log(x)/2) 0.0137 0.0143 70.0 7.63MB 15.0 ``` ] - If one of 'ns', 'us', 'ms', 's', 'm', 'h', 'd', 'w' the time units are instead expressed as nanoseconds, microseconds, milliseconds, seconds, hours, minutes, days or weeks respectively. --- ## Relative performance ```r class(b) ``` ``` #> [1] "bench_mark" "tbl_df" "tbl" "data.frame" ``` ```r summary(b, relative = TRUE) ``` ``` #> # A tibble: 4 x 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 sqrt(x) 1 1 8.33 1 22.3 #> 2 x^0.5 9.46 8.45 1 1 1.00 #> 3 x^(1/2) 9.19 8.20 1.05 1 1 #> 4 exp(log(x)/2) 6.99 6.31 1.37 1 1.54 ``` --- ## CPU and real time .tiny[ ```r system.time({ x <- c() for (i in 1:100000) { x <- c(x, rnorm(1)) + 5 } }) ``` ``` #> user system elapsed #> 17.491 8.274 25.786 ``` ] -- .tiny[ ```r system.time({ x <- numeric(length = 100000) for (i in 1:100000) { x[i] <- rnorm(1) + 5 } }) ``` ``` #> user system elapsed #> 0.169 0.048 0.218 ``` ] -- .tiny[ ```r system.time({ rnorm(100000) + 5 }) ``` ``` #> user system elapsed #> 0.006 0.000 0.006 ``` ] --- ```r x <- data.frame(matrix(rnorm(100000), nrow = 1)) ``` ```r bench_time({ types <- numeric(dim(x)[2]) for (i in seq_along(x)) { types[i] <- typeof(x[i]) } }) ``` ``` #> process real #> 9.64s 9.65s ``` -- ```r bench_time({ sapply(x, typeof) }) ``` ``` #> process real #> 55.8ms 56ms ``` -- ```r bench_time({ purrr::map_chr(x, typeof) }) ``` ``` #> process real #> 478ms 480ms ``` --- ## Exercise Which function reads in CSV files quickest? - `read.csv()` - `readr::read_csv()` - `data.table::fread()` Test them with http://bit.ly/nz-data and the 2013 capital bike share data available at http://www2.stat.duke.edu/~sms185/data/bike/cbs_2013.csv. ??? ## Solutions ```r nz_link <- "http://bit.ly/nz-data" bench_time({ read.csv(nz_link) }) ``` ``` #> process real #> 208.6ms 3.6s ``` ```r bench_time({ readr::read_csv(nz_link) }) ``` ``` #> process real #> 261ms 455ms ``` ```r bench_time({ data.table::fread(nz_link) }) ``` ``` #> process real #> 166ms 867ms ``` ```r bike_link <- "http://www2.stat.duke.edu/~sms185/data/bike/cbs_2013.csv" bench_time({ read.csv(bike_link) }) ``` ``` #> process real #> 45.06s 1.05m ``` ```r bench_time({ readr::read_csv(bike_link) }) ``` ``` #> process real #> 16.4s 41.6s ``` ```r bench_time({ data.table::fread(bike_link) }) ``` ``` #> process real #> 9.14s 23.72s ``` --- class: inverse, center, middle # Parallelization --- ## Code bounds Your R [substitute a language] computations are typically bounded by some combination of the following four factors. 1. CPUs 2. Memory 3. Inputs / Outputs 4. Network -- Today we'll focus on how our computations (in some instances) can be less affected by the first bound. --- ## Terminology - **CPU**: central processing unit, primary component of a computer that processes instructions -- - **Core**: an individual processor within a CPU, more cores will improve performance and efficiency - You can get a Duke VM with 2 cores - Your laptop probably has 2, 4, or 8 cores - DSS R cluster has 16 cores - Duke's computing cluster (DCC) has 15,667 cores -- - **User CPU time**: the CPU time spent by the current process, in our case, the R session - **System CPU time**: the CPU time spent by the OS on behalf of the current running process --- ## Run in serial or parallel Suppose I have `\(n\)` tasks, `\(t_1\)`, `\(t_2\)`, `\(\ldots\)`, `\(t_n\)`, that I want to run. <br/> To **run in serial** implies that first task `\(t_1\)` is run and we wait for it to complete. Next, task `\(t_2\)` is run. Upon its completion the next task is run, and so on, until task `\(t_n\)` is complete. If each task takes `\(s\)` seconds to complete, then my theoretical run time is `\(sn\)`. <br/> Assume I have `\(n\)` cores. To **run in parallel** means I can divide my `\(n\)` tasks among the `\(n\)` cores. For instance, task `\(t_1\)` runs on core 1, task `\(t_2\)` runs on core 2, etc. Again, if each task takes `\(s\)` seconds to complete and I have `\(n\)` cores, then my theoretical run time is `\(s\)` seconds - this is never the case. *Here we assume all `\(n\)` tasks are independent.* --- ## Ways to parallelize 1. Sockets <br/><br/> A new version of R is launched on each core. - Available on all systems - Each process on each core is unique <br/><br/> 2. Forking <br/><br/> A copy of the current R session is moved to new cores. - Not available on Windows - Less overhead and easy to implement --- ## Package `parallel` This package builds on packages `snow` and `multicore`. It can handle much larger chunks of computation in parallel. ```r library(parallel) ``` Core functions: - `detectCores()` - `pvec()`, based on forking - `mclapply()`, based on forking - `mcparallel()`, `mccollect()`, based on forking Follow along on our DSS R cluster. --- ## How many cores do I have? On my MacBook Pro ```r detectCores() ``` `#> [1] 8` <br/> -- On pawn, rook, knight ```r detectCores() ``` `#> [1] 16` --- ## `pvec()` Using forking, `pvec()` parellelizes the execution of a function on vector elements by splitting the vector and submitting each part to one core. ```r system.time(rnorm(1e7) ^ 4) ``` ``` #> user system elapsed #> 0.742 0.017 0.764 ``` ```r system.time(pvec(v = rnorm(1e7), FUN = `^`, 4, mc.cores = 1)) ``` ``` #> user system elapsed #> 0.735 0.016 0.755 ``` ```r system.time(pvec(v = rnorm(1e7), FUN = `^`, 4, mc.cores = 2)) ``` ``` #> user system elapsed #> 1.344 0.409 1.349 ``` --- ```r system.time(pvec(v = rnorm(1e7), FUN = `^`, 4, mc.cores = 4)) ``` ``` #> user system elapsed #> 1.044 0.226 0.924 ``` ```r system.time(pvec(v = rnorm(1e7), FUN = `^`, 4, mc.cores = 6)) ``` ``` #> user system elapsed #> 1.111 0.217 0.880 ``` ```r system.time(pvec(v = rnorm(1e7), FUN = `^`, 4, mc.cores = 8)) ``` ``` #> user system elapsed #> 1.186 0.273 0.893 ``` --- <img src="lec-21_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> *Don't underestimate the overhead cost!* --- ## `mclapply()` Using forking, `mclapply()` is a parallelized version of `lapply()`. Recall that `lapply()` returns a list, similar to `map()`. ```r system.time(rnorm(1e6)) #> user system elapsed #> 0.101 0.007 0.107 system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 2))) #> user system elapsed #> 0.148 0.136 0.106 system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4))) #> user system elapsed #> 0.242 0.061 0.052 ``` --- ```r system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 6))) #> user system elapsed #> 0.097 0.047 0.079 system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 8))) #> user system elapsed #> 0.193 0.076 0.040 system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 10))) #> user system elapsed #> 0.162 0.083 0.041 system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 12))) #> user system elapsed #> 0.098 0.065 0.037 ``` --- ## Another example ```r delayed_rpois <- function() { Sys.sleep(1) rpois(10, lambda = 3) } ``` .tiny[ ```r bench_time(mclapply(1:8, function(x) delayed_rpois(), mc.cores = 1)) ``` ``` #> process real #> 6.6ms 8.03s ``` ```r bench_time(mclapply(1:8, function(x) delayed_rpois(), mc.cores = 4)) ``` ``` #> process real #> 8.81ms 2.02s ``` ```r bench_time(mclapply(1:8, function(x) delayed_rpois(), mc.cores = 8)) ``` ``` #> process real #> 16.56ms 1.02s ``` ```r # I don't have 800 cores bench_time(mclapply(1:8, function(x) delayed_rpois(), mc.cores = 800)) ``` ``` #> process real #> 16.34ms 1.02s ``` ] --- ## `mcparallel()` & `mccollect()` Using forking, evaluate an R expression asynchronously in a separate process. .tiny[ ```r x <- list() x$pois <- mcparallel({ Sys.sleep(1) rpois(10, 2) }) x$norm <- mcparallel({ Sys.sleep(2) rnorm(10) }) x$beta <- mcparallel({ Sys.sleep(3) rbeta(10, 1, 1) }) result <- mccollect(x) str(result) ``` ``` #> List of 3 #> $ 13450: int [1:10] 3 5 0 2 3 1 2 0 2 3 #> $ 13451: num [1:10] 0.777 1.097 -0.185 -0.199 0.118 ... #> $ 13452: num [1:10] 0.17 0.807 0.864 0.272 0.284 ... ``` ] --- ```r bench_time({ x <- list() x$pois <- mcparallel({ Sys.sleep(1) rpois(10, 2) }) x$norm <- mcparallel({ Sys.sleep(2) rnorm(10) }) x$beta <- mcparallel({ Sys.sleep(3) rbeta(10, 1, 1) }) result <- mccollect(x) }) ``` ``` #> process real #> 6.03ms 3.01s ``` --- ## A closer look at `mcparallel()` & `mccollect()` ```r str(x) ``` ``` #> List of 3 #> $ pois:List of 2 #> ..$ pid: int 13464 #> ..$ fd : int [1:2] 4 7 #> ..- attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process" #> $ norm:List of 2 #> ..$ pid: int 13465 #> ..$ fd : int [1:2] 5 9 #> ..- attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process" #> $ beta:List of 2 #> ..$ pid: int 13466 #> ..$ fd : int [1:2] 6 11 #> ..- attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process" ``` --- To check some of your results early set `wait = FALSE` and a timeout time in seconds. ```r p <- mcparallel({ Sys.sleep(1) mean(rnorm(100)) }) mccollect(p, wait = FALSE, timeout = 2) ``` ``` #> $`13471` #> [1] 0.1146858 ``` -- However, if you are impatient, you may get a NULL value. ```r q <- mcparallel({ Sys.sleep(1) mean(rnorm(100)) }) mccollect(q, wait = FALSE) ``` ``` #> NULL ``` ```r mccollect(q) ``` ``` #> $`13480` #> [1] 0.01058231 ``` --- class: inverse, center, middle # Sockets --- ## Using sockets to parallelize The basic recipe is as follows: ```r library(parallel) detectCores() *c1 <- makeCluster() result <- clusterApply(cl = c1, ...) stopCluster(c1) ``` <br/> Here you are spawning new R sessions. Data, packages, functions, etc. need to be shipped to the workers. <br/> We'll go into more details on using sockets next lecture. --- ## References - https://nceas.github.io/oss-lessons/parallel-computing-in-r/parallel-computing-in-r.html - https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf - https://ljdursi.github.io/beyond-single-core-R/#/