Profiling & Parallelization

class: center, middle, inverse, title-slide

# Profiling & Parallelization
### Colin Rundel
### 2018-10-31

---

exclude: true

```
## 
## Attaching package: 'dplyr'
```

```
## The following objects are masked from 'package:stats':
## 
##     filter, lag
```

```
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
```

```
## Loading required package: iterators
```

---
class: middle
count: false

# Profiling

---

## profvis demo

```r
n=1e6

d = data_frame(
  x1 = rt(n, df = 3),
  x2 = rt(n, df = 3),
  x3 = rt(n, df = 3),
  x4 = rt(n, df = 3),
  x5 = rt(n, df = 3),
) %>%
  mutate(y = -2*x1 - 1*x2 + 0*x3 + 1*x4 + 2*x5 + rnorm(n))

profvis::profvis(lm(y~., data=d))
```

---

## profvis + shiny demo

```r
x=source("ex_2018_10_29.R")
profvis(print(x))
```

---

## Benchmarking - `bench`

```r
d = data_frame(x = runif(10000, 1, 1000), y=runif(10000, 1, 1000))

b = bench::mark(
  d[d$x > 500, ],
  d[which(d$x > 500), ],
  subset(d, x > 500),
  filter(d, x > 500)
)
b
```

```
## # A tibble: 4 x 10
##   expression   min  mean median      max `itr/sec` mem_alloc  n_gc n_itr
##   <chr>      <bch> <bch> <bch:> <bch:tm>     <dbl> <bch:byt> <dbl> <int>
## 1 d[d$x > 5… 150µs 319µs  276µs   2.66ms     3135.  283.22KB     6  1379
## 2 d[which(d… 106µs 153µs  135µs 549.65µs     6553.  176.61KB     9  2752
## 3 subset(d,… 227µs 310µs  269µs   1.38ms     3224.   379.9KB     9  1378
## 4 filter(d,… 263µs 445µs  342µs   6.36ms     2247.    4.32MB     5  1036
## # ... with 1 more variable: total_time <bch:tm>
```

---

## `bench` - relative results

```r
summary(b, relative=TRUE)
```

```
## # A tibble: 4 x 10
##   expressi…   min  mean median   max `itr/sec` mem_alloc  n_gc n_itr total_time
##   <chr>     <dbl> <dbl>  <dbl> <dbl>     <dbl>     <dbl> <dbl> <dbl>      <dbl>
## 1 d[d$x > …  1.42  2.09   2.05  4.85      1.40      1.60   1.2  1.33       1.05
## 2 d[which(…  1     1      1     1         2.92      1      1.8  2.66       1   
## 3 subset(d…  2.15  2.03   2.00  2.51      1.43      2.15   1.8  1.33       1.02
## 4 filter(d…  2.49  2.92   2.54 11.6       1        25.0    1    1          1.10
```

---
class: middle
count: false

# Parallelization

---

## `parallel`

Part of the base packages in R

* tools for the forking of R processes (some functions do not work on Windows)

* Core functions:
    
    * `detectCores`

* `pvec`

* `mclapply`

* `mcparallel` & `mccollect`

---

## `detectCores`

Surprisingly, detects the number of cores of the current system.

```r
detectCores()

## [1] 24
```

---

## pvec

Parallelization of a vectorized function call

```r
system.time(pvec(1:1e7, sqrt, mc.cores = 1))

##   user  system elapsed 
##  0.214   0.029   0.243

system.time(pvec(1:1e7, sqrt, mc.cores = 4))

##   user  system elapsed 
##  0.442   0.185   0.631

system.time(pvec(1:1e7, sqrt, mc.cores = 8))

##   user  system elapsed 
##  0.532   0.389   0.372 
```

---

## pvec - `bench::system_time`

```r
bench::system_time(pvec(1:1e7, sqrt, mc.cores = 1))

##  process    real 
##    180ms   180ms

bench::system_time(pvec(1:1e7, sqrt, mc.cores = 4))

##  process    real 
##    935ms   980ms

bench::system_time(pvec(1:1e7, sqrt, mc.cores = 8))

##  process    real 
##    1.01s   1.05s 
```

```r
bench::system_time(Sys.sleep(.5))

##  process     real 
##   1.93ms 500.09ms 
```

---

```r
cores = c(1,4,8,16)
order = 6:8
res = map(
  cores, 
  function(x) {
     map_dbl(order, function(y) system.time)[3]) 
  }
) %>% do.call(rbind,.)

rownames(res) = paste0(cores," cores")
colnames(res) = paste0("10^",order)

res

##            10^6  10^7  10^8
##  1 cores  0.016 0.282 1.489
##  2 cores  0.070 0.526 5.198
##  4 cores  0.052 0.430 5.023
##  8 cores  0.054 0.376 4.098
##  16 cores 0.073 0.401 4.049
```

---

## mclapply

Parallelized version of `lapply`

```r
system.time(rnorm(1e6))

##   user  system elapsed 
##  0.101   0.007   0.107

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 2)))

##   user  system elapsed 
##  0.148   0.136   0.106

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4)))

##   user  system elapsed 
##  0.242   0.061   0.052
```

---

```r
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4)))

##   user  system elapsed 
##  0.097   0.047   0.079

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 8)))

##   user  system elapsed 
##  0.193   0.076   0.040

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 10)))

##   user  system elapsed 
##  0.162   0.083   0.041

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 12)))

##   user  system elapsed 
##  0.098   0.065   0.037 
```

---

## mcparallel

Asynchronously evaluation of an R expression in a separate process

```r
m = mcparallel(rnorm(1e6))
n = mcparallel(rbeta(1e6,1,1))
o = mcparallel(rgamma(1e6,1,1))

str(m)
```

```
## List of 2
##  $ pid: int 13601
##  $ fd : int [1:2] 4 7
##  - attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process"
```

```r
str(n)
```

```
## List of 2
##  $ pid: int 13602
##  $ fd : int [1:2] 5 9
##  - attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process"
```

---

## mccollect

Checks `mcparallel` objects for completion

```r
str(mccollect(list(m,n,o)))
```

```
## List of 3
##  $ 13601: num [1:1000000] -0.5659 -1.0518 1.3621 -0.6802 0.0894 ...
##  $ 13602: num [1:1000000] 0.738 0.293 0.173 0.678 0.481 ...
##  $ 13603: num [1:1000000] 0.9743 0.708 1.0087 2.8562 0.0716 ...
```

---

## mccollect - waiting

```r
p = mcparallel(mean(rnorm(1e5)))
mccollect(p, wait = FALSE, 10) # will retrieve the result (since it's fast)
```

```
## $`13604`
## [1] 0.001585076
```

```r
mccollect(p, wait = FALSE)     # will signal the job as terminating
```

```
## $`13604`
## NULL
```

```r
mccollect(p, wait = FALSE)     # there is no longer such a job
```

```
## NULL
```

---
class: middle
count: false

# doMC & foreach

---

## doMC & foreach

Packages by Revolution Analytics that provides the `foreach` function which is a parallelizable `for` loop (and then some).

* Core functions:
    
    * `registerDoMC`

* `foreach`, `%dopar%`, `%do%`

---

## `registerDoMC`

Primarily used to set the number of cores used by `foreach`, by default uses `options("cores")` or half the number of cores found by `detectCores` from the parallel package.

```r
options("cores")

## $cores
## NULL

detectCores()

## [1] 24

getDoParWorkers()

## [1] 1

registerDoMC(4)
getDoParWorkers()

## [1] 4
```

---

## `foreach`

A slightly more powerful version of base `for` loops (think `for` with an `lapply` flavor). Combined with `%do%` or `%dopar%` for single or multicore execution.

```r
for(i in 1:10) sqrt(i)

foreach(i = 1:5) %do% 
  sqrt(i)   
```

```
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1.414214
## 
## [[3]]
## [1] 1.732051
## 
## [[4]]
## [1] 2
## 
## [[5]]
## [1] 2.236068
```

---

## `foreach` - iterators

`foreach` can iterate across more than one value, but it doesn't do length coercion

.pull-left[

```r
foreach(i = 1:5, j = 1:5) %do% 
  sqrt(i^2+j^2)   
```

```
## [[1]]
## [1] 1.414214
## 
## [[2]]
## [1] 2.828427
## 
## [[3]]
## [1] 4.242641
## 
## [[4]]
## [1] 5.656854
## 
## [[5]]
## [1] 7.071068
```
]

.pull-right[

```r
foreach(i = 1:5, j = 1:2) %do% 
  sqrt(i^2+j^2)   
```

```
## [[1]]
## [1] 1.414214
## 
## [[2]]
## [1] 2.828427
```
]

---

## `foreach` - combining results

```r
foreach(i = 1:5, .combine='c') %do% 
  sqrt(i)   
```

```
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068
```

```r
foreach(i = 1:5, .combine='cbind') %do% 
  sqrt(i)   
```

```
##      result.1 result.2 result.3 result.4 result.5
## [1,]        1 1.414214 1.732051        2 2.236068
```

```r
foreach(i = 1:5, .combine='+') %do% 
  sqrt(i)   
```

```
## [1] 8.382332
```

---

## `foreach` - parallelization

Swapping out `%do%` for `%dopar%` will use the parallel backend.

```r
registerDoMC(4)
```

```r
system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))
```

```
##    user  system elapsed 
##   1.036   0.100   0.721
```

```r
registerDoMC(8)
system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))
```

```
##    user  system elapsed 
##   1.040   0.115   0.584
```

```r
registerDoMC(12)
system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))
```

```
##    user  system elapsed 
##   0.844   0.100   0.616
```

---

## Example - Bootstraping

Bootstrapping is a resampling scheme where the original data is repeatedly reconstructed by taking a sample (with replacement) of the same size as the original data, and using that to conduct whatever analysis procedure is of interest. Below is an example of fitting a local regression (`loess`) to some synthetic data, we will construct a bootstrap prediction interval for this model.

. . .

```r
set.seed(3212016)
d = data.frame(x = 1:120) %>%
    mutate(y = sin(2*pi*x/120) + runif(length(x),-1,1))

l = loess(y ~ x, data=d)
d = d %>% mutate(
  pred_y = predict(l),
  pred_y_se = predict(l,se=TRUE)$se.fit
) %>% mutate(
  pred_low  = pred_y - 1.96 * pred_y_se,
  pred_high = pred_y + 1.96 * pred_y_se
)
```

---

```r
ggplot(d, aes(x,y)) +
  geom_point(color="gray50") +
  geom_ribbon(aes(ymin=pred_low, ymax=pred_high), fill="red", alpha=0.25) +
  geom_line(aes(y=pred_y)) +
  theme_bw()
```

---

## Example - Cont.

We will now re-implement the code below using one of the parallelization techniques we have just discussed and will then check the performance with 1, 2, and 4 cores.

```r
n_rep = 5000
d_xy = select(d, x, y)

res = map(1:n_rep, function(i) {
  d_xy %>% 
    select(x,y) %>% 
    sample_n(nrow(d), replace=TRUE) %>%
    loess(y ~ x, data=.) %>%
    predict(newdata=d) %>%
    setNames(NULL)
}) %>% do.call(cbind, .)

d = d %>% mutate(
  bs_low = apply(res,1,quantile,probs=c(0.025), na.rm=TRUE),
  bs_high  = apply(res,1,quantile,probs=c(0.975), na.rm=TRUE)
)
```

---

```r
ggplot(d, aes(x,y)) +
  geom_point(color="gray50") +
  geom_ribbon(aes(ymin=pred_low, ymax=pred_high), fill="red", alpha=0.25) +
  geom_ribbon(aes(ymin=bs_low, ymax=bs_high), fill="blue", alpha=0.25) +
  geom_line(aes(y=pred_y)) +
  theme_bw()
```

---

## furrr / future

```r
system.time( purrr::map(c(2,2,2), Sys.sleep) )

##  user  system elapsed 
##  0.032   0.024   6.004

system.time( furrr::future_map(c(2,2,2), Sys.sleep) )

##   user  system elapsed 
##  0.110   0.028   6.066

future::plan(future::multiprocess)
system.time( furrr::future_map(c(2,2,2), Sys.sleep) )

##   user  system elapsed 
##  0.091   0.114   2.212
```

---

## What to use when?

Optimal use of multiple cores is hard, there isn't one best solution

* Don't underestimate the overhead cost

* More art than science - experimentation is key

* Measure it or it didn't happen

* Be aware of the trade off between developer time and run time

---
class: middle
count: false

# BLAS and LAPACK

---

## Statistics and Linear Algebra

An awful lot of statistics is at its core linear algebra.

<br/>

For example:

* Linear regession models, find

$$ \hat{\beta} = (X^T X)^{-1} X^Ty $$

* Principle component analysis

* Find `$T = XW$` where `$W$` is a matrix whose columns are the eigenvectors of `$X^TX$`.
    
    * Often solved via SVD - Let `$X = U\Sigma W^T$` then `$T = U\Sigma$`.

---

## Numerical Linear Algebra

Not unique to Statistics, these are the type of problems that come up across all areas of numerical computing.

* Numerical linear algebra `$\ne$` mathematical linear algebra

<br/>

* Efficiency and stability of numerical algorithms matter

* Designing and implementing these algorithms is hard

<br/>

* Don't reinvent the wheel - common core linear algebra tools (well defined API)

---

## BLAS and LAPACK

Low level algorithms for common linear algebra operations

<br/>

BLAS

* **B**asic **L**inear **A**lgebra **S**ubprograms

* Copying, scaling, multiplying vectors and matrices

* Origins go back to 1979, written in Fortran

LAPACK

* **L**inear **A**lgebra **Pack**age

* Higher level functionality building on BLAS.

* Linear solvers, eigenvalues, and matrix decompositions

* Origins go back to 1992, mostly Fortran (expanded on LINPACK, EISPACK)

---

## Modern variants?

Most default BLAS and LAPACK implementations (like R's defaults) are somewhat dated

* Written in Fortran and designed for a single cpu core

* Certain (potentially non-optimal) hard coded defaults (e.g. block size).

Multithreaded alternatives:

* ATLAS - Automatically Tuned Linear Algebra Software

* OpenBLAS - fork of GotoBLAS from TACC at UTexas

* Intel MKL - Math Kernel Library, part of Intel's commercial compiler tools

* cuBLAS / Magma - GPU libraries from NVidia and UTK respectively

---

## OpenBLAS  Matrix Multiply (DGEMM) Performance

|  n   | 1 core | 2 cores | 4 cores | 8 cores |
|------|--------|---------|---------|---------|
| 100  |  0.001 | 0.001   | 0.000   | 0.000   |
| 500  |  0.018 | 0.011   | 0.008   | 0.008   |
| 1000 |  0.128 | 0.068   | 0.041   | 0.036   |
| 2000 |  0.930 | 0.491   | 0.276   | 0.162   |
| 3000 |  3.112 | 1.604   | 0.897   | 0.489   |
| 4000 |  7.330 | 3.732   | 1.973   | 1.188   |
| 5000 | 14.223 | 7.341   | 3.856   | 2.310   |

---

---