---
title: Parallelization
author: "Colin Rundel"
date: "2018-03-06"
output:
xaringan::moon_reader:
css: "slides.css"
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
exclude: true
```{r, message=FALSE, warning=FALSE, include=FALSE}
options(
htmltools.dir.version = FALSE, # for blogdown
width = 80,
tibble.width = 80
)
knitr::opts_chunk$set(
fig.align = "center"
)
htmltools::tagList(rmarkdown::html_dependency_font_awesome())
```
```{r setup, echo=FALSE}
library(dplyr)
library(ggplot2)
library(tidyr)
library(parallel)
library(foreach)
library(doMC)
ggplot2::theme_set(theme_bw())
```
---
class: middle
count: false
# Parallelization
---
## `parallel`
Part of the base packages in R
* tools for the forking of R processes (some functions do not work on Windows)
* Core functions:
* `detectCores`
* `pvec`
* `mclapply`
* `mcparallel` & `mccollect`
---
## `detectCores`
Surprisingly, detects the number of cores of the current system.
```{r eval=FALSE}
detectCores()
## [1] 24
```
---
## pvec
Parallelization of a vectorized function call
```{r eval=FALSE}
system.time(pvec(1:1e7, sqrt, mc.cores = 1))
## user system elapsed
## 0.214 0.029 0.243
system.time(pvec(1:1e7, sqrt, mc.cores = 4))
## user system elapsed
## 0.442 0.185 0.631
system.time(pvec(1:1e7, sqrt, mc.cores = 8))
## user system elapsed
## 0.532 0.389 0.372
```
---
```{r eval=FALSE}
cores = c(1,2,4,8,16)
order = 6:8
res = map(
cores,
function(x) {
map_dbl(order, function(y) system.time(pvec(1:(10^y), sqrt, mc.cores=x))[3])
}
) %>% do.call(rbind,.)
rownames(res) = paste0(cores," cores")
colnames(res) = paste0("10^",order)
res
## 10^6 10^7 10^8
## 1 cores 0.016 0.282 1.489
## 2 cores 0.070 0.526 5.198
## 4 cores 0.052 0.430 5.023
## 8 cores 0.054 0.376 4.098
## 16 cores 0.073 0.401 4.049
```
---
## mclapply
Parallelized version of `lapply`
```{r eval=FALSE}
system.time(rnorm(1e6))
## user system elapsed
## 0.101 0.007 0.107
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 2)))
## user system elapsed
## 0.148 0.136 0.106
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4)))
## user system elapsed
## 0.242 0.061 0.052
```
---
```{r eval=FALSE}
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4)))
## user system elapsed
## 0.097 0.047 0.079
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 8)))
## user system elapsed
## 0.193 0.076 0.040
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 10)))
## user system elapsed
## 0.162 0.083 0.041
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 12)))
## user system elapsed
## 0.098 0.065 0.037
```
---
## mcparallel
Asynchronously evaluation of an R expression in a separate process
```{r}
m = mcparallel(rnorm(1e6))
n = mcparallel(rbeta(1e6,1,1))
o = mcparallel(rgamma(1e6,1,1))
str(m)
str(n)
```
---
## mccollect
Checks `mcparallel` objects for completion
```{r}
str(mccollect(list(m,n,o)))
```
---
## mccollect - waiting
```{r}
p = mcparallel(mean(rnorm(1e5)))
mccollect(p, wait = FALSE, 10) # will retrieve the result (since it's fast)
mccollect(p, wait = FALSE) # will signal the job as terminating
mccollect(p, wait = FALSE) # there is no longer such a job
```
---
class: middle
count: false
# doMC & foreach
---
## doMC & foreach
Packages by Revolution Analytics that provides the `foreach` function which is a parallelizable `for` loop (and then some).
* Core functions:
* `registerDoMC`
* `foreach`, `%dopar%`, `%do%`
---
## `registerDoMC`
Primarily used to set the number of cores used by `foreach`, by default uses `options("cores")` or half the number of cores found by `detectCores` from the parallel package.
```{r eval=FALSE}
options("cores")
## $cores
## NULL
detectCores()
## [1] 24
getDoParWorkers()
## [1] 1
registerDoMC(4)
getDoParWorkers()
## [1] 4
```
---
## `foreach`
A slightly more powerful version of base `for` loops (think `for` with an `lapply` flavor). Combined with `%do%` or `%dopar%` for single or multicore execution.
```{r}
for(i in 1:10) sqrt(i)
foreach(i = 1:5) %do%
sqrt(i)
```
---
## `foreach` - iterators
`foreach` can iterate across more than one value, but it doesn't do length coercion
.pull-left[
```{r}
foreach(i = 1:5, j = 1:5) %do%
sqrt(i^2+j^2)
```
]
.pull-right[
```{r}
foreach(i = 1:5, j = 1:2) %do%
sqrt(i^2+j^2)
```
]
---
## `foreach` - combining results
```{r}
foreach(i = 1:5, .combine='c') %do%
sqrt(i)
foreach(i = 1:5, .combine='cbind') %do%
sqrt(i)
foreach(i = 1:5, .combine='+') %do%
sqrt(i)
```
---
## `foreach` - parallelization
Swapping out `%do%` for `%dopar%` will use the parallel backend.
```{r}
registerDoMC(4)
```
```{r}
system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))
registerDoMC(8)
system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))
registerDoMC(12)
system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))
```
---
## Example - Bootstraping
Bootstrapping is a resampling scheme where the original data is repeatedly reconstructed by taking a sample (with replacement) of the same size as the original data, and using that to conduct whatever analysis procedure is of interest. Below is an example of fitting a local regression (`loess`) to some synthetic data, we will construct a bootstrap prediction interval for this model.
. . .
```{r}
set.seed(3212016)
d = data.frame(x = 1:120) %>%
mutate(y = sin(2*pi*x/120) + runif(length(x),-1,1))
l = loess(y ~ x, data=d)
d = d %>% mutate(
pred_y = predict(l),
pred_y_se = predict(l,se=TRUE)$se.fit
) %>% mutate(
pred_low = pred_y - 1.96 * pred_y_se,
pred_high = pred_y + 1.96 * pred_y_se
)
```
---
```{r fig.align="center", fig.width=9, fig.height=6}
ggplot(d, aes(x,y)) +
geom_point(color="gray50") +
geom_ribbon(aes(ymin=pred_low, ymax=pred_high), fill="red", alpha=0.25) +
geom_line(aes(y=pred_y)) +
theme_bw()
```
---
## Example - Cont.
We will now re-implement the code below using one of the parallelization techniques we have just discussed and will then check the performance with 1, 2, and 4 cores.
```{r eval=FALSE}
n_rep = 5000
d_xy = select(d, x, y)
res = map(1:n_rep, function(i) {
d_xy %>%
select(x,y) %>%
sample_n(nrow(d), replace=TRUE) %>%
loess(y ~ x, data=.) %>%
predict(newdata=d) %>%
setNames(NULL)
}) %>% do.call(cbind, .)
d = d %>% mutate(
bs_low = apply(res,1,quantile,probs=c(0.025), na.rm=TRUE),
bs_high = apply(res,1,quantile,probs=c(0.975), na.rm=TRUE)
)
```
---
```{r eval=FALSE}
ggplot(d, aes(x,y)) +
geom_point(color="gray50") +
geom_ribbon(aes(ymin=pred_low, ymax=pred_high), fill="red", alpha=0.25) +
geom_ribbon(aes(ymin=bs_low, ymax=bs_high), fill="blue", alpha=0.25) +
geom_line(aes(y=pred_y)) +
theme_bw()
```
```{r echo=FALSE}
knitr::include_graphics("imgs/bootstrap_loess.png")
```
---
## What to use when?
Optimal use of multiple cores is hard, there isn't one best solution
* Don't underestimate the overhead cost
* More art than science - experimentation is key
* Measure it or it didn't happen
* Be aware of the trade off between developer time and run time
---
class: middle
count: false
# BLAS and LAPACK
---
## Statistics and Linear Algebra
An awful lot of statistics is at its core linear algebra.
For example:
* Linear regession models, find
$$ \hat{\beta} = (X^T X)^{-1} X^Ty $$
* Principle component analysis
* Find $T = XW$ where $W$ is a matrix whose columns are the eigenvectors of $X^TX$.
* Often solved via SVD - Let $X = U\Sigma W^T$ then $T = U\Sigma$.
---
## Numerical Linear Algebra
Not unique to Statistics, these are the type of problems that come up across all areas of numerical computing.
* Numerical linear algebra $\ne$ mathematical linear algebra
* Efficiency and stability of numerical algorithms matter
* Designing and implementing these algorithms is hard
* Don't reinvent the wheel - common core linear algebra tools (well defined API)
---
## BLAS and LAPACK
Low level algorithms for common linear algebra operations
BLAS
* **B**asic **L**inear **A**lgebra **S**ubprograms
* Copying, scaling, multiplying vectors and matrices
* Origins go back to 1979, written in Fortran
LAPACK
* **L**inear **A**lgebra **Pack**age
* Higher level functionality building on BLAS.
* Linear solvers, eigenvalues, and matrix decompositions
* Origins go back to 1992, mostly Fortran (expanded on LINPACK, EISPACK)
---
## Modern variants?
Most default BLAS and LAPACK implementations (like R's defaults) are somewhat dated
* Written in Fortran and designed for a single cpu core
* Certain (potentially non-optimal) hard coded defaults (e.g. block size).
Multithreaded alternatives:
* ATLAS - Automatically Tuned Linear Algebra Software
* OpenBLAS - fork of GotoBLAS from TACC at UTexas
* Intel MKL - Math Kernel Library, part of Intel's commercial compiler tools
* cuBLAS / Magma - GPU libraries from NVidia and UTK respectively
---
## OpenBLAS Matrix Multiply (DGEMM) Performance
```{r, eval=FALSE, include=FALSE}
library(RhpcBLASctl)
x=matrix(runif(5000^2),ncol=5000)
sizes = c(100,500,1000,2000,3000,4000,5000)
cores = c(1,2,4,8)
sapply(
cores,
function(n_cores)
{
blas_set_num_threads(n_cores)
sapply(
sizes,
function(s)
{
y = x[1:s,1:s]
system.time(y %*% y)[3]
}
)
}
)
```
| n | 1 core | 2 cores | 4 cores | 8 cores |
|------|--------|---------|---------|---------|
| 100 | 0.001 | 0.001 | 0.000 | 0.000 |
| 500 | 0.018 | 0.011 | 0.008 | 0.008 |
| 1000 | 0.128 | 0.068 | 0.041 | 0.036 |
| 2000 | 0.930 | 0.491 | 0.276 | 0.162 |
| 3000 | 3.112 | 1.604 | 0.897 | 0.489 |
| 4000 | 7.330 | 3.732 | 1.973 | 1.188 |
| 5000 | 14.223 | 7.341 | 3.856 | 2.310 |
---
```{r echo=FALSE, fig.width=8}
d = tribble(
~"n", ~"1 core", ~"2 cores", ~"4 cores", ~"8 cores", ~"K20X", ~"P100",
500, 0.018, 0.011, 0.008, 0.008, NA , NA ,
1000, 0.128, 0.068, 0.041, 0.036, 0.00598, 0.00072,
2000, 0.930, 0.491, 0.276, 0.162, 0.01733, 0.00471,
3000, 3.112, 1.604, 0.897, 0.489, 0.05417, 0.01382,
4000, 7.330, 3.732, 1.973, 1.188, 0.12554, 0.03243,
5000, 14.223, 7.341, 3.856, 2.310, 0.24091, 0.06228
)
d %>%
gather(cores,time,-n) %>%
ggplot(aes(x=n, y=time, color=cores)) +
geom_point() +
geom_line()
```
---
```{r echo=FALSE, fig.width=8}
d %>%
gather(cores,time,-n) %>%
ggplot(aes(x=n, y=time, color=cores)) +
geom_point() +
geom_line() +
scale_y_continuous(trans='log2')
```