class: center, middle, inverse, title-slide # Profiling & Parallelization ### Colin Rundel ### 2019-03-26 --- exclude: true ``` ## ## Attaching package: 'dplyr' ``` ``` ## The following objects are masked from 'package:stats': ## ## filter, lag ``` ``` ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union ``` ``` ## Loading required package: iterators ``` ``` ## ## Attaching package: 'purrr' ``` ``` ## The following objects are masked from 'package:foreach': ## ## accumulate, when ``` --- class: middle count: false # Profiling & Benchmarking --- ## profvis demo ```r n=1e6 d = data_frame( x1 = rt(n, df = 3), x2 = rt(n, df = 3), x3 = rt(n, df = 3), x4 = rt(n, df = 3), x5 = rt(n, df = 3), ) %>% mutate(y = -2*x1 - 1*x2 + 0*x3 + 1*x4 + 2*x5 + rnorm(n)) profvis::profvis(lm(y~., data=d)) ``` --- ## Benchmarking - `bench` ```r d = data_frame(x = runif(10000, 1, 1000), y=runif(10000, 1, 1000)) ``` ``` ## Warning: `data_frame()` is deprecated, use `tibble()`. ## This warning is displayed once per session. ``` ```r b = bench::mark( d[d$x > 500, ], d[which(d$x > 500), ], subset(d, x > 500), filter(d, x > 500) ) b ``` ``` ## # A tibble: 4 x 10 ## expression min mean median max `itr/sec` mem_alloc n_gc n_itr ## <chr> <bch> <bch> <bch:> <bch:tm> <dbl> <bch:byt> <dbl> <int> ## 1 d[d$x > 5… 161µs 279µs 264µs 784.23µs 3582. 294KB 6 1565 ## 2 d[which(d… 149µs 229µs 221µs 659.76µs 4368. 175KB 6 1821 ## 3 subset(d,… 314µs 484µs 451µs 1.8ms 2067. 387KB 6 889 ## 4 filter(d,… 198µs 341µs 282µs 4.21ms 2929. 255KB 4 1327 ## # … with 1 more variable: total_time <bch:tm> ``` --- ## `bench` - relative results ```r summary(b, relative=TRUE) ``` ``` ## # A tibble: 4 x 10 ## expression min mean median max `itr/sec` mem_alloc n_gc n_itr total_time ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 d[d$x > 5… 1.08 1.22 1.19 1.19 1.73 1.69 1.5 1.76 1.05 ## 2 d[which(d… 1 1 1 1 2.11 1 1.5 2.05 1 ## 3 subset(d,… 2.10 2.11 2.04 2.73 1 2.21 1.5 1 1.03 ## 4 filter(d,… 1.33 1.49 1.27 6.39 1.42 1.46 1 1.49 1.09 ``` --- class: middle count: false # Parallelization --- ## `parallel` Part of the base packages in R * tools for the forking of R processes (some functions do not work on Windows) * Core functions: * `detectCores` * `pvec` * `mclapply` * `mcparallel` & `mccollect` --- ## `detectCores` Surprisingly, detects the number of cores of the current system. ```r detectCores() ## [1] 24 ``` --- ## pvec Parallelization of a vectorized function call ```r system.time(pvec(1:1e7, sqrt, mc.cores = 1)) ## user system elapsed ## 0.214 0.029 0.243 system.time(pvec(1:1e7, sqrt, mc.cores = 4)) ## user system elapsed ## 0.442 0.185 0.631 system.time(pvec(1:1e7, sqrt, mc.cores = 8)) ## user system elapsed ## 0.532 0.389 0.372 ``` --- ## pvec - `bench::system_time` ```r bench::system_time(pvec(1:1e7, sqrt, mc.cores = 1)) ## process real ## 180ms 180ms bench::system_time(pvec(1:1e7, sqrt, mc.cores = 4)) ## process real ## 935ms 980ms bench::system_time(pvec(1:1e7, sqrt, mc.cores = 8)) ## process real ## 1.01s 1.05s ``` -- ```r bench::system_time(Sys.sleep(.5)) ## process real ## 1.93ms 500.09ms ``` --- ```r cores = c(1,4,8,16) order = 6:8 res = map( cores, function(x) { map_dbl(order, function(y) system.time)[3]) } ) %>% do.call(rbind,.) rownames(res) = paste0(cores," cores") colnames(res) = paste0("10^",order) res ## 10^6 10^7 10^8 ## 1 cores 0.016 0.282 1.489 ## 2 cores 0.070 0.526 5.198 ## 4 cores 0.052 0.430 5.023 ## 8 cores 0.054 0.376 4.098 ## 16 cores 0.073 0.401 4.049 ``` --- ## mclapply Parallelized version of `lapply` ```r system.time(rnorm(1e6)) ## user system elapsed ## 0.101 0.007 0.107 system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 2))) ## user system elapsed ## 0.148 0.136 0.106 system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4))) ## user system elapsed ## 0.242 0.061 0.052 ``` --- ```r system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4))) ## user system elapsed ## 0.097 0.047 0.079 system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 8))) ## user system elapsed ## 0.193 0.076 0.040 system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 10))) ## user system elapsed ## 0.162 0.083 0.041 system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 12))) ## user system elapsed ## 0.098 0.065 0.037 ``` --- ## mcparallel Asynchronously evaluation of an R expression in a separate process ```r m = mcparallel(rnorm(1e6)) n = mcparallel(rbeta(1e6,1,1)) o = mcparallel(rgamma(1e6,1,1)) str(m) ``` ``` ## List of 2 ## $ pid: int 11994 ## $ fd : int [1:2] 4 7 ## - attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process" ``` ```r str(n) ``` ``` ## List of 2 ## $ pid: int 11995 ## $ fd : int [1:2] 5 9 ## - attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process" ``` --- ## mccollect Checks `mcparallel` objects for completion ```r str(mccollect(list(m,n,o))) ``` ``` ## List of 3 ## $ 11994: num [1:1000000] -0.848 0.666 1.931 -0.173 0.538 ... ## $ 11995: num [1:1000000] 0.558 0.975 0.16 0.896 0.29 ... ## $ 11996: num [1:1000000] 2.135 0.264 0.123 0.218 0.456 ... ``` --- ## mccollect - waiting ```r p = mcparallel(mean(rnorm(1e5))) mccollect(p, wait = FALSE, 10) # will retrieve the result (since it's fast) ``` ``` ## $`11997` ## [1] -0.001857127 ``` ```r mccollect(p, wait = FALSE) # will signal the job as terminating ``` ``` ## $`11997` ## NULL ``` ```r mccollect(p, wait = FALSE) # there is no longer such a job ``` ``` ## Warning in selectChildren(jobs, timeout): cannot wait for child 11997 as it does ## not exist ``` ``` ## NULL ``` --- class: middle count: false # doMC & foreach --- ## doMC & foreach Packages by Revolution Analytics that provides the `foreach` function which is a parallelizable `for` loop (and then some). * Core functions: * `registerDoMC` * `foreach`, `%dopar%`, `%do%` --- ## `registerDoMC` Primarily used to set the number of cores used by `foreach`, by default uses `options("cores")` or half the number of cores found by `detectCores` from the parallel package. ```r options("cores") ## $cores ## NULL detectCores() ## [1] 24 getDoParWorkers() ## [1] 1 registerDoMC(4) getDoParWorkers() ## [1] 4 ``` --- ## `foreach` A slightly more powerful version of base `for` loops (think `for` with an `lapply` flavor). Combined with `%do%` or `%dopar%` for single or multicore execution. ```r for(i in 1:10) sqrt(i) foreach(i = 1:5) %do% sqrt(i) ``` ``` ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 1.414214 ## ## [[3]] ## [1] 1.732051 ## ## [[4]] ## [1] 2 ## ## [[5]] ## [1] 2.236068 ``` --- ## `foreach` - iterators `foreach` can iterate across more than one value, but it doesn't do length coercion .pull-left[ ```r foreach(i = 1:5, j = 1:5) %do% sqrt(i^2+j^2) ``` ``` ## [[1]] ## [1] 1.414214 ## ## [[2]] ## [1] 2.828427 ## ## [[3]] ## [1] 4.242641 ## ## [[4]] ## [1] 5.656854 ## ## [[5]] ## [1] 7.071068 ``` ] .pull-right[ ```r foreach(i = 1:5, j = 1:2) %do% sqrt(i^2+j^2) ``` ``` ## [[1]] ## [1] 1.414214 ## ## [[2]] ## [1] 2.828427 ``` ] --- ## `foreach` - combining results ```r foreach(i = 1:5, .combine='c') %do% sqrt(i) ``` ``` ## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 ``` ```r foreach(i = 1:5, .combine='cbind') %do% sqrt(i) ``` ``` ## result.1 result.2 result.3 result.4 result.5 ## [1,] 1 1.414214 1.732051 2 2.236068 ``` ```r foreach(i = 1:5, .combine='+') %do% sqrt(i) ``` ``` ## [1] 8.382332 ``` --- ## `foreach` - parallelization Swapping out `%do%` for `%dopar%` will use the parallel backend. ```r registerDoMC(4) ``` ```r system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6))) ``` ``` ## user system elapsed ## 0.410 0.033 0.349 ``` ```r registerDoMC(8) system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6))) ``` ``` ## user system elapsed ## 0.982 0.073 0.347 ``` ```r registerDoMC(12) system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6))) ``` ``` ## user system elapsed ## 1.017 0.090 0.315 ``` --- ## Example - Bootstraping Bootstrapping is a resampling scheme where the original data is repeatedly reconstructed by taking a sample (with replacement) of the same size as the original data, and using that to conduct whatever analysis procedure is of interest. Below is an example of fitting a local regression (`loess`) to some synthetic data, we will construct a bootstrap prediction interval for this model. -- ```r set.seed(3212016) d = data.frame(x = 1:120) %>% mutate(y = sin(2*pi*x/120) + runif(length(x),-1,1)) l = loess(y ~ x, data=d) d = d %>% mutate( pred_y = predict(l), pred_y_se = predict(l,se=TRUE)$se.fit ) %>% mutate( pred_low = pred_y - 1.96 * pred_y_se, pred_high = pred_y + 1.96 * pred_y_se ) ``` --- ```r ggplot(d, aes(x,y)) + geom_point(color="gray50") + geom_ribbon(aes(ymin=pred_low, ymax=pred_high), fill="red", alpha=0.25) + geom_line(aes(y=pred_y)) + theme_bw() ``` <img src="Lec15_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> --- ## Example - Cont. We will now re-implement the code below using one of the parallelization techniques we have just discussed and will then check the performance with 1, 2, and 4 cores. ```r n_rep = 5000 d_xy = select(d, x, y) res = map( 1:n_rep, function(i) { d_xy %>% select(x,y) %>% sample_n(n(), replace=TRUE) %>% loess(y ~ x, data=.) %>% predict(newdata=d) %>% setNames(NULL) } ) %>% do.call(cbind, .) d = d %>% mutate( bs_low = apply(res,1,quantile,probs=c(0.025), na.rm=TRUE), bs_high = apply(res,1,quantile,probs=c(0.975), na.rm=TRUE) ) ``` --- ```r ggplot(d, aes(x,y)) + geom_point(color="gray50") + geom_ribbon(aes(ymin=pred_low, ymax=pred_high), fill="red", alpha=0.25) + geom_ribbon(aes(ymin=bs_low, ymax=bs_high), fill="blue", alpha=0.25) + geom_line(aes(y=pred_y)) + theme_bw() ``` <img src="imgs/bootstrap_loess.png" width="1343" style="display: block; margin: auto;" /> --- ## furrr / future ```r system.time( purrr::map(c(2,2,2), Sys.sleep) ) ## user system elapsed ## 0.032 0.024 6.004 system.time( furrr::future_map(c(2,2,2), Sys.sleep) ) ## user system elapsed ## 0.110 0.028 6.066 future::plan(future::multiprocess) system.time( furrr::future_map(c(2,2,2), Sys.sleep) ) ## user system elapsed ## 0.091 0.114 2.212 ``` --- ## What to use when? Optimal use of multiple cores is hard, there isn't one best solution * Don't underestimate the overhead cost * More art than science - experimentation is key * Measure it or it didn't happen * Be aware of the trade off between developer time and run time --- class: middle count: false # BLAS and LAPACK --- ## Statistics and Linear Algebra An awful lot of statistics is at its core linear algebra. <br/> For example: * Linear regession models, find $$ \hat{\beta} = (X^T X)^{-1} X^Ty $$ * Principle component analysis * Find `\(T = XW\)` where `\(W\)` is a matrix whose columns are the eigenvectors of `\(X^TX\)`. * Often solved via SVD - Let `\(X = U\Sigma W^T\)` then `\(T = U\Sigma\)`. --- ## Numerical Linear Algebra Not unique to Statistics, these are the type of problems that come up across all areas of numerical computing. * Numerical linear algebra `\(\ne\)` mathematical linear algebra <br/> * Efficiency and stability of numerical algorithms matter * Designing and implementing these algorithms is hard <br/> * Don't reinvent the wheel - common core linear algebra tools (well defined API) --- ## BLAS and LAPACK Low level algorithms for common linear algebra operations <br/> BLAS * **B**asic **L**inear **A**lgebra **S**ubprograms * Copying, scaling, multiplying vectors and matrices * Origins go back to 1979, written in Fortran LAPACK * **L**inear **A**lgebra **Pack**age * Higher level functionality building on BLAS. * Linear solvers, eigenvalues, and matrix decompositions * Origins go back to 1992, mostly Fortran (expanded on LINPACK, EISPACK) --- ## Modern variants? Most default BLAS and LAPACK implementations (like R's defaults) are somewhat dated * Written in Fortran and designed for a single cpu core * Certain (potentially non-optimal) hard coded defaults (e.g. block size). Multithreaded alternatives: * ATLAS - Automatically Tuned Linear Algebra Software * OpenBLAS - fork of GotoBLAS from TACC at UTexas * Intel MKL - Math Kernel Library, part of Intel's commercial compiler tools * cuBLAS / Magma - GPU libraries from NVidia and UTK respectively --- ## OpenBLAS Matrix Multiply (DGEMM) Performance | n | 1 core | 2 cores | 4 cores | 8 cores | |------|--------|---------|---------|---------| | 100 | 0.001 | 0.001 | 0.000 | 0.000 | | 500 | 0.018 | 0.011 | 0.008 | 0.008 | | 1000 | 0.128 | 0.068 | 0.041 | 0.036 | | 2000 | 0.930 | 0.491 | 0.276 | 0.162 | | 3000 | 3.112 | 1.604 | 0.897 | 0.489 | | 4000 | 7.330 | 3.732 | 1.973 | 1.188 | | 5000 | 14.223 | 7.341 | 3.856 | 2.310 | --- <img src="Lec15_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> --- <img src="Lec15_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" />