---
title: Parallelization
author: "Colin Rundel"
date: "2018-03-06"
output:
  xaringan::moon_reader:
    css: "slides.css"
    lib_dir: libs
    nature:
      highlightStyle: github
      highlightLines: true
      countIncrementalSlides: false
---
exclude: true

```{r, message=FALSE, warning=FALSE, include=FALSE}
options(
  htmltools.dir.version = FALSE, # for blogdown
  width = 80,
  tibble.width = 80
)

knitr::opts_chunk$set(
  fig.align = "center"
)

htmltools::tagList(rmarkdown::html_dependency_font_awesome())
```

```{r setup, echo=FALSE}
library(dplyr)
library(ggplot2)
library(tidyr)
library(parallel)
library(foreach)
library(doMC)

ggplot2::theme_set(theme_bw())
```

---
class: middle
count: false

# Parallelization

---

## `parallel`

Part of the base packages in R 

* tools for the forking of R processes (some functions do not work on Windows)

* Core functions:
    
    * `detectCores`

    * `pvec`

    * `mclapply`

    * `mcparallel` & `mccollect`


---

## `detectCores`

Surprisingly, detects the number of cores of the current system.

```{r eval=FALSE}
detectCores()

## [1] 24
```

---

## pvec

Parallelization of a vectorized function call

```{r eval=FALSE}
system.time(pvec(1:1e7, sqrt, mc.cores = 1))

##   user  system elapsed 
##  0.214   0.029   0.243 

system.time(pvec(1:1e7, sqrt, mc.cores = 4))

##   user  system elapsed 
##  0.442   0.185   0.631 

system.time(pvec(1:1e7, sqrt, mc.cores = 8))

##   user  system elapsed 
##  0.532   0.389   0.372 
```

---

```{r eval=FALSE}
cores = c(1,2,4,8,16)
order = 6:8
res = map(
  cores, 
  function(x) {
     map_dbl(order, function(y) system.time(pvec(1:(10^y), sqrt, mc.cores=x))[3]) 
  }
) %>% do.call(rbind,.)

rownames(res) = paste0(cores," cores")
colnames(res) = paste0("10^",order)

res

##            10^6  10^7  10^8
##  1 cores  0.016 0.282 1.489
##  2 cores  0.070 0.526 5.198
##  4 cores  0.052 0.430 5.023
##  8 cores  0.054 0.376 4.098
##  16 cores 0.073 0.401 4.049
```

---

## mclapply

Parallelized version of `lapply`

```{r eval=FALSE}
system.time(rnorm(1e6))

##   user  system elapsed 
##  0.101   0.007   0.107 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 2)))

##   user  system elapsed 
##  0.148   0.136   0.106 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4)))

##   user  system elapsed 
##  0.242   0.061   0.052
```

---

```{r eval=FALSE}
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4)))

##   user  system elapsed 
##  0.097   0.047   0.079 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 8)))

##   user  system elapsed 
##  0.193   0.076   0.040 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 10)))

##   user  system elapsed 
##  0.162   0.083   0.041 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 12)))

##   user  system elapsed 
##  0.098   0.065   0.037 
```


---

## mcparallel

Asynchronously evaluation of an R expression in a separate process

```{r}
m = mcparallel(rnorm(1e6))
n = mcparallel(rbeta(1e6,1,1))
o = mcparallel(rgamma(1e6,1,1))

str(m)
str(n)
```


---

## mccollect

Checks `mcparallel` objects for completion

```{r}
str(mccollect(list(m,n,o)))
```

---

## mccollect - waiting

```{r}
p = mcparallel(mean(rnorm(1e5)))
mccollect(p, wait = FALSE, 10) # will retrieve the result (since it's fast)
mccollect(p, wait = FALSE)     # will signal the job as terminating
mccollect(p, wait = FALSE)     # there is no longer such a job
```


---
class: middle
count: false

# doMC & foreach

---

## doMC & foreach

Packages by Revolution Analytics that provides the `foreach` function which is a parallelizable `for` loop (and then some).

* Core functions:
    
    * `registerDoMC`

    * `foreach`, `%dopar%`, `%do%`

---

## `registerDoMC` 

Primarily used to set the number of cores used by `foreach`, by default uses `options("cores")` or half the number of cores found by `detectCores` from the parallel package.

```{r eval=FALSE}
options("cores")

## $cores
## NULL

detectCores()

## [1] 24

getDoParWorkers()

## [1] 1

registerDoMC(4)
getDoParWorkers()

## [1] 4
```

---

## `foreach`

A slightly more powerful version of base `for` loops (think `for` with an `lapply` flavor). Combined with `%do%` or `%dopar%` for single or multicore execution.

```{r}
for(i in 1:10) sqrt(i)

foreach(i = 1:5) %do% 
  sqrt(i)   
```


---

## `foreach` - iterators

`foreach` can iterate across more than one value, but it doesn't do length coercion

.pull-left[
```{r}
foreach(i = 1:5, j = 1:5) %do% 
  sqrt(i^2+j^2)   
```
]

.pull-right[
```{r}
foreach(i = 1:5, j = 1:2) %do% 
  sqrt(i^2+j^2)   
```
]

---

## `foreach` - combining results

```{r}
foreach(i = 1:5, .combine='c') %do% 
  sqrt(i)   
foreach(i = 1:5, .combine='cbind') %do% 
  sqrt(i)   
foreach(i = 1:5, .combine='+') %do% 
  sqrt(i)   
```

---

## `foreach` - parallelization

Swapping out `%do%` for `%dopar%` will use the parallel backend.

```{r}
registerDoMC(4)
```

```{r}
system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))
registerDoMC(8)
system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))
registerDoMC(12)
system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))
```


---

## Example - Bootstraping

Bootstrapping is a resampling scheme where the original data is repeatedly reconstructed by taking a sample (with replacement) of the same size as the original data, and using that to conduct whatever analysis procedure is of interest. Below is an example of fitting a local regression (`loess`) to some synthetic data, we will construct a bootstrap prediction interval for this model.

. . .

```{r}
set.seed(3212016)
d = data.frame(x = 1:120) %>%
    mutate(y = sin(2*pi*x/120) + runif(length(x),-1,1))

l = loess(y ~ x, data=d)
d = d %>% mutate(
  pred_y = predict(l),
  pred_y_se = predict(l,se=TRUE)$se.fit
) %>% mutate(
  pred_low  = pred_y - 1.96 * pred_y_se,
  pred_high = pred_y + 1.96 * pred_y_se
)
```

---

```{r fig.align="center", fig.width=9, fig.height=6}
ggplot(d, aes(x,y)) +
  geom_point(color="gray50") +
  geom_ribbon(aes(ymin=pred_low, ymax=pred_high), fill="red", alpha=0.25) +
  geom_line(aes(y=pred_y)) +
  theme_bw()
```

---

## Example - Cont.

We will now re-implement the code below using one of the parallelization techniques we have just discussed and will then check the performance with 1, 2, and 4 cores.


```{r eval=FALSE}
n_rep = 5000
d_xy = select(d, x, y)

res = map(1:n_rep, function(i) {
  d_xy %>% 
    select(x,y) %>% 
    sample_n(nrow(d), replace=TRUE) %>%
    loess(y ~ x, data=.) %>%
    predict(newdata=d) %>%
    setNames(NULL)
}) %>% do.call(cbind, .)

d = d %>% mutate(
  bs_low = apply(res,1,quantile,probs=c(0.025), na.rm=TRUE),
  bs_high  = apply(res,1,quantile,probs=c(0.975), na.rm=TRUE)
)
```

---

```{r eval=FALSE}
ggplot(d, aes(x,y)) +
  geom_point(color="gray50") +
  geom_ribbon(aes(ymin=pred_low, ymax=pred_high), fill="red", alpha=0.25) +
  geom_ribbon(aes(ymin=bs_low, ymax=bs_high), fill="blue", alpha=0.25) +
  geom_line(aes(y=pred_y)) +
  theme_bw()
```

```{r echo=FALSE}
knitr::include_graphics("imgs/bootstrap_loess.png")
```

---

## What to use when?

Optimal use of multiple cores is hard, there isn't one best solution

* Don't underestimate the overhead cost

* More art than science - experimentation is key

* Measure it or it didn't happen

* Be aware of the trade off between developer time and run time


---
class: middle
count: false

# BLAS and LAPACK

---

## Statistics and Linear Algebra

An awful lot of statistics is at its core linear algebra.

<br/>

For example:

* Linear regession models, find

$$ \hat{\beta} = (X^T X)^{-1} X^Ty $$

* Principle component analysis

    * Find $T = XW$ where $W$ is a matrix whose columns are the eigenvectors of $X^TX$.
    
    * Often solved via SVD - Let $X = U\Sigma W^T$ then $T = U\Sigma$.


---

## Numerical Linear Algebra

Not unique to Statistics, these are the type of problems that come up across all areas of numerical computing.

* Numerical linear algebra $\ne$ mathematical linear algebra

<br/>

* Efficiency and stability of numerical algorithms matter

    * Designing and implementing these algorithms is hard

<br/>

* Don't reinvent the wheel - common core linear algebra tools (well defined API)


---

## BLAS and LAPACK

Low level algorithms for common linear algebra operations

<br/>

BLAS

* **B**asic **L**inear **A**lgebra **S**ubprograms

* Copying, scaling, multiplying vectors and matrices

* Origins go back to 1979, written in Fortran

<br/>

LAPACK

* **L**inear **A**lgebra **Pack**age

* Higher level functionality building on BLAS.

* Linear solvers, eigenvalues, and matrix decompositions

* Origins go back to 1992, mostly Fortran (expanded on LINPACK, EISPACK)


---

## Modern variants?

Most default BLAS and LAPACK implementations (like R's defaults) are somewhat dated

* Written in Fortran and designed for a single cpu core  

* Certain (potentially non-optimal) hard coded defaults (e.g. block size).

<br/>

Multithreaded alternatives:

* ATLAS - Automatically Tuned Linear Algebra Software

* OpenBLAS - fork of GotoBLAS from TACC at UTexas

* Intel MKL - Math Kernel Library, part of Intel's commercial compiler tools

* cuBLAS / Magma - GPU libraries from NVidia and UTK respectively


---

## OpenBLAS  Matrix Multiply (DGEMM) Performance

```{r, eval=FALSE, include=FALSE}
library(RhpcBLASctl)
x=matrix(runif(5000^2),ncol=5000)

sizes = c(100,500,1000,2000,3000,4000,5000)
cores = c(1,2,4,8)

sapply(
  cores, 
  function(n_cores) 
  {
    blas_set_num_threads(n_cores)
    sapply(
      sizes, 
      function(s) 
      {
           y = x[1:s,1:s]
           system.time(y %*% y)[3]
      }
    )
  }
)
```

|  n   | 1 core | 2 cores | 4 cores | 8 cores |
|------|--------|---------|---------|---------|
| 100  |  0.001 | 0.001   | 0.000   | 0.000   |
| 500  |  0.018 | 0.011   | 0.008   | 0.008   |
| 1000 |  0.128 | 0.068   | 0.041   | 0.036   |
| 2000 |  0.930 | 0.491   | 0.276   | 0.162   |
| 3000 |  3.112 | 1.604   | 0.897   | 0.489   |
| 4000 |  7.330 | 3.732   | 1.973   | 1.188   |
| 5000 | 14.223 | 7.341   | 3.856   | 2.310   |

---

```{r echo=FALSE, fig.width=8}
d = tribble(
  ~"n",  ~"1 core", ~"2 cores", ~"4 cores", ~"8 cores", ~"K20X", ~"P100",
  500,  0.018,  0.011, 0.008, 0.008, NA     , NA     ,
  1000, 0.128,  0.068, 0.041, 0.036, 0.00598, 0.00072,
  2000, 0.930,  0.491, 0.276, 0.162, 0.01733, 0.00471,
  3000, 3.112,  1.604, 0.897, 0.489, 0.05417, 0.01382,
  4000, 7.330,  3.732, 1.973, 1.188, 0.12554, 0.03243,
  5000, 14.223, 7.341, 3.856, 2.310, 0.24091, 0.06228
)


d %>% 
  gather(cores,time,-n) %>%
  ggplot(aes(x=n, y=time, color=cores)) +
    geom_point() +
    geom_line()
```

---

```{r echo=FALSE, fig.width=8}
d %>% 
  gather(cores,time,-n) %>%
  ggplot(aes(x=n, y=time, color=cores)) +
    geom_point() +
    geom_line() +
    scale_y_continuous(trans='log2')
```