Parallelization and Profiling

# Parallelization and Profiling
## Statistical Computing & Programming
### Shawn Santo
### 06-18-20

---

## Supplementary materials

Companion videos

- [Parallelization using sockets](https://warpwire.duke.edu/w/kdkDAA/)
- [Packages `doMC` and `foreach`](https://warpwire.duke.edu/w/ldkDAA/)
- [Profiling code](https://warpwire.duke.edu/w/l9kDAA/)

Additional resources

- Getting Started with `doMC` and `foreach` [vignette](https://cran.r-project.org/web/packages/doMC/vignettes/gettingstartedMC.pdf)
- `profvis` [guide](https://rstudio.github.io/profvis/)
    
---

# Recall

---

## Benchmarking with package `bench`

```r
library(bench)
x <- runif(n = 1000000)
bench::mark(
 sqrt(x),
 x ^ 0.5,
 x ^ (1 / 2),
 min_time = Inf, max_iterations = 1000
)
```

```
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 sqrt(x) 1.93ms 2.27ms 431. 7.63MB 121. 
#> 2 x^0.5 17.27ms 18.7ms 53.3 7.63MB 8.89
#> 3 x^(1/2) 17.27ms 18.81ms 52.8 7.63MB 8.81
```

Functions `Sys.time()` and `bench::system_time()` are also available for
you to time your code.

---

## Ways to parallelize

1. Sockets
 
A new version of R is launched on each core.
 - Available on all systems
 - Each process on each core is unique
 
2. Forking
 
A copy of the current R session is moved to new cores.
 - **Not available on Windows**
 - Less overhead and easy to implement

---

## Package `parallel`

```r
library(parallel)
```

Some core functions:

- `detectCores()`

- `pvec()`, parallelize a vector map function using forking
    - Argument `mc.cores` is used to set the number of cores

- `mclapply()`, parallel version of `lapply()` using forking
    - Argument `mc.cores` is used to set the number of cores
    - Arguments `mc.preschedule` and `affinity.list` can be used for
      load balancing.

- `mcparallel()`, `mccollect()`, evaluate an R expression asynchronously 
  in a separate process

Our DSS R cluster has 16 cores available for use while
your laptop probably has 4 or 8.

---

## Load balancing example

```r
sleepR <- function(x) {
 Sys.sleep(x)
 runif(1)
}
x <- c(2.5, 2.5, 5)
aff_list_bal <- c(1, 1, 2)
aff_list_unbal <- c(1, 2, 2)
```
]

```r
# balanced load
system.time({
  mclapply(x, sleepR, 
*          mc.preschedule = FALSE, affinity.list = aff_list_bal)
})
```

```
#>    user  system elapsed 
#>   0.011   0.023   5.024
```
]

```r
# unbalanced load
system.time({
  mclapply(x, sleepR, 
*          mc.preschedule = FALSE, affinity.list = aff_list_unbal)
})
```

```
#>    user  system elapsed 
#>   0.005   0.009   7.516
```
]

---

# Sockets

---

## Using sockets to parallelize

The basic recipe is as follows:

```r
detectCores()
c1 <- makeCluster()
result <- clusterApply(cl = c1, ...)
stopCluster(c1)
```

Here you are spawning new R sessions. Data, packages, functions, etc. need to be
shipped to the workers.

---

## Sockets example

Function `clusterEvalQ()` evaluates a literal expression on each cluster node.

```r
clust <- makeCluster(4)
library(nycflights13)
clusterEvalQ(cl = clust, dim(flights))
stopCluster(clust)
```

```r
Error in checkForRemoteErrors(lapply(cl, recvResult)) : 
  4 nodes produced errors; first error: object 'flights' not found
```

There is no inheritance. Package `nycflights13` is not loaded on the new R
sessions spawned on each individual core.

---

```r
clust <- makeCluster(4)
clusterEvalQ(cl = clust, {
 library(nycflights13)
 dim(flights)
 }
 )
```

```
#> [[1]]
#> [1] 336776     19
#> 
#> [[2]]
#> [1] 336776     19
#> 
#> [[3]]
#> [1] 336776     19
#> 
#> [[4]]
#> [1] 336776     19
```

```r
stopCluster(clust)
```

Function `clusterExport()` can be used to pass objects from the master process
to the corresponding spawned sessions.

---

```r
cl <- makeCluster(4)
*library(nycflights13)
*clusterExport(cl = cl, varlist = c("flights"))
clusterEvalQ(cl = cl, {dim(flights)})
```

```
#> [[1]]
#> [1] 336776     19
#> 
#> [[2]]
#> [1] 336776     19
#> 
#> [[3]]
#> [1] 336776     19
#> 
#> [[4]]
#> [1] 336776     19
```

```r
stopCluster(cl)
```

---

## Apply operations using clusters

There exists a family of analogous apply functions that use clusters.

|   Function      |               Description               |
|:----------------|:----------------------------------------|
| `parApply()`    | parallel version of `apply()`           |
| `parLapply()`   | parallel version of `lapply()`          |
| `parLapplyLB()` | load balancing version of `parLapply()` |
| `parSapply()`   | parallel version of `sapply()`          |
| `parSapplyLB()` | load balancing version of `parSapply()` |

The first argument is a cluster object. Subsequent arguments are similar
to the corresponding base `apply()` variants.

---

## Bootstrapping

Parallelize the bootstrap process using `dplyr` functions.

```r
library(tidyverse)
cl <- makeCluster(4)

boot_samples <- clusterEvalQ(cl = cl, {
 library(dplyr)
 create_boot_sample <- function() {
 mtcars %>% 
 select(mpg) %>% 
 sample_n(size = nrow(mtcars), replace = TRUE)
 }
 replicate(2500, create_boot_sample())
 }
)

map(boot_samples, ~parLapply(cl, X = ., fun = mean)) %>% 
  unlist() %>% 
  as_tibble() %>% 
  ggplot(aes(x = value)) +
  geom_histogram() +
  theme_minimal(base_size = 16)

stopCluster(cl)
```

---

---

# `doMC` and `foreach`

---

## Parallelized `for` loop

Package `doMC` is a parallel backend for the `foreach` package - a package
that allows you to execute `for` loops in parallel.

```r
library(doMC)
library(foreach)
```

Key functions:

- `doMC::registerDoMC()`, set the number of cores for the parallel backend to
  be used with `foreach`
  
- `foreach`, `%dopar%`, `%do%`, parallel loop

`doMC` serves as an interface between `foreach` and `multicore`. Since 
`multicore` only works with systems that support forking, these functions
will not work properly on Windows.

---

## Set workers

To get started, set the number of cores with `registerDoMC()`.

```r
# check cores set up
getDoParWorkers()
```

```
#> [1] 1
```

```r
# set 4 cores
registerDoMC(4)
getDoParWorkers()
```

```
#> [1] 4
```

---

## Serial and parallel with `foreach()`

Sequential

```r
foreach(i = 1:4) %do% 
  sort(runif(n = 1e7, max = i))[1]
```

```
#> [[1]]
#> [1] 2.91504e-07
#> 
#> [[2]]
#> [1] 9.732321e-08
#> 
#> [[3]]
#> [1] 4.163012e-07
#> 
#> [[4]]
#> [1] 2.421439e-07
```
]

```r
times(2) %do%
  sort(runif(n = 1e7))[1]
```

```
#> [1] 6.076880e-08 6.705523e-08
```
]

]

Parallel

```r
foreach(i = 1:4) %dopar%
  sort(runif(n = 1e7, max = i))[1]
```

```
#> [[1]]
#> [1] 7.357448e-08
#> 
#> [[2]]
#> [1] 1.862645e-08
#> 
#> [[3]]
#> [1] 8.919742e-07
#> 
#> [[4]]
#> [1] 1.183711e-06
```
]

```r
times(2) %dopar%
  sort(runif(n = 1e7))[1]
```

```
#> [1] 7.217750e-09 3.157184e-07
```
]

]

---

## Time comparison

Sequential

```r
system.time({
  foreach(i = 1:4) %do% 
    sort(runif(n = 1e7, max = i))[1]
})
```

```
#>    user  system elapsed 
#>   3.466   0.180   3.670
```
]

```r
system.time({
  for (i in 1:4)
    sort(runif(n = 1e7, max = i))[1]
})
```

```
#>    user  system elapsed 
#>   3.954   0.136   4.123
```
]

]

Parallel

```r
system.time({
  foreach(i = 1:4) %dopar%
    sort(runif(n = 1e7, max = i))[1]
})
```

```
#>    user  system elapsed 
#>   2.586   0.513   1.607
```
]

]

Even with four cores we don't see a four factor improvement in time.

---

## Iterate over multiple indices

Add more indices separated by commas. Argument `.combine` allows you to format
the result into something other than the default list.

```r
foreach(i = 1:3, j = 0:-2, .combine = "c") %dopar% {i ^ j}
```

```
#> [1] 1.0000000 0.5000000 0.1111111
```

```r
foreach(i = 1:3, j = 0:-4, .combine = "c") %dopar% {i ^ j}
```

```
#> [1] 1.0000000 0.5000000 0.1111111
```

```r
foreach(i = 1:4, j = 0:1, .combine = "c") %dopar% {i ^ j}
```

```
#> [1] 1 2
```

Length coercion is not supported. We'll need a nested structure.

```r
foreach(i = 1:4, .combine = "c") %:% 
  foreach(j = 0:1, .combine = "c") %dopar% {i ^ j}
```

```
#> [1] 1 1 1 2 1 3 1 4
```

---

## Another example with `%:%`

Consider a 50 x 50 banded matrix `V`.

```r
p <- 50

V <- foreach(i = 1:p, .combine = "cbind") %:%
 foreach(j = 1:p, .combine = "c") %dopar% {
 if (abs(i - j) < 5) {
 rnorm(n = 1, sd = 1 + abs(i - j))
 } else {
 0
 }
 }

str(V)
```

```
#>  num [1:50, 1:50] -1.103 0.705 -2.672 -2.852 -3.744 ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : chr [1:50] "result.1" "result.2" "result.3" "result.4" ...
```

---

## Exercise

The 1986 crash of the space shuttle Challenger was linked to failure of 
O-ring seals in the rocket engines. Data was collected on the 23 previous 
shuttle missions.

Perform leave-one-out cross validation in parallel fitting a logistic 
regression model where the response is `damage` / `no_damage`, predictor is 
`temp`, and data is from `orings` in package `faraway`.

```r
library(tidyverse)
library(faraway)
data("orings")
orings <- orings %>% 
 mutate(damage = ifelse(damage > 0, 1, 0),
 no_damage = 1 - damage)
```

Compute the average test errors:

`$$\mbox{average test error} = \frac{1}{n} \sum_{i = 1}^{n} 1_{(y_i \neq \hat{y}_i^{-i})}$$`

???

## Solution

```r
registerDoMC(4)
foreach(i = 1:dim(orings)[1], .combine = "+") %dopar% {
 m <- glm(cbind(damage, no_damage) ~ temp, family = "binomial", 
 data = orings[-i, , drop = FALSE]) 
 y_hat <- round(predict(m, newdata = orings[i, , drop = FALSE], type = "response"))
 y <- orings[i, , drop = FALSE]$damage
 (abs(y - y_hat)) / dim(orings)[1]
}
```

```
#>         1 
#> 0.1304348
```
]

---

## Exercise hint

```r
library(tidyverse)
library(faraway)
data("orings")
orings <- orings %>% 
 mutate(damage = ifelse(damage > 0, 1, 0),
 no_damage = 1 - damage)
```
]

Compute the average test errors:

`$$\mbox{average test error} = \frac{1}{n} \sum_{i = 1}^{n} 1_{(y_i \neq \hat{y}_i^{-i})}$$`

Template code:

```r
m <- glm(cbind(damage, no_damage) ~ temp, family = "binomial", 
 data = orings[-i, , drop = FALSE]) 
y_hat <- round(predict(m, newdata = orings[i, , drop = FALSE], type = "response"))
y <- orings[i, , drop = FALSE]$damage
```
]

---

## More bootstrap

Create a function that returns `$\hat{\beta}_1$`.

```r
quiet_glm <- quietly(glm)

get_b1 <- function() {
 orings_boot <- orings %>% 
 sample_n(size = dim(orings)[1], replace = TRUE)
 m <- quiet_glm(cbind(damage, no_damage) ~ temp, 
 family = "binomial", data = orings_boot)
 b1 <- coef(m$result)["temp"]
 if (length(m$warnings)) {b1 <- NULL}
 return(b1)
}
```

Generate 10,000 bootstrap samples.

```r
N <- 10000
registerDoMC(4)
b1_boot_sample <- times(N) %dopar% get_b1()
```

---
.tiny[

```r
tibble(x = b1_boot_sample) %>% 
  ggplot(aes(x = x)) +
  geom_histogram(bins = 30, fill = "blue", color = "grey", alpha = .4) +
  labs(x = expression(hat(beta)[1])) + theme_bw(base_size = 16)
```

<img src="lec-22_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" />
]

```r
quantile(b1_boot_sample, c(.025, .975))
```

```
#>        2.5%       97.5% 
#> -0.56552863 -0.07648069
```

```r
quantile(b1_boot_sample, c(.03, .98))
```

```
#>          3%         98% 
#> -0.54740157 -0.06193099
```
]

---

## Time check

In parallel, 4 cores:

```r
N <- 10000
registerDoMC(4)
system.time({b1_boot_sample <- times(N) %dopar% get_b1()})
```

```
#>    user  system elapsed 
#>  23.044   3.117   7.466
```

In parallel, 8 cores:

```r
registerDoMC(8)
system.time({b1_boot_sample <- times(N) %dopar% get_b1()})
```

```
#>    user  system elapsed 
#>  30.361   4.431   6.435
```

Sequentially:

```r
system.time({replicate(N, get_b1())})
```

```
#>    user  system elapsed 
#>  14.842   1.686  16.635
```

---

# Profiling

---

## Profiling with `profvis`

We can do more than just time our code. Package `profvis` provides an
interactive interface to visualize profiling data.

```r
library(profvis)
```

To profile your code

- wrap your R expression inside `profvis()`,

- or use RStudio's GUI under the `Profile` tab.

---

## Exercise

First, profile the below code. Then try to improve the computation time while 
keeping the loops and not using parallel computing. Lastly, try an apply variant
and evaluate the performance.

```r
reps <- 10000
n <- 1000

beta_0 <- 2
beta_1 <- .5
beta_2 <- 3

beta_1_hat_all <- c()

for (s in c(1, 3, 7)) {
 beta_1_hat <- c()
 for (i in 1:reps) {
 X <- cbind(rnorm(n), rnorm(n) ^ 2)
 Y <- beta_0 + beta_1 * X[, 1, drop = FALSE] + 
 beta_2 * X[, 2, drop = FALSE] + rnorm(n, sd = s)
 m <- lm(Y~X)
 beta_1_hat <- c(beta_1_hat, coefficients(m)[2])
 }
 beta_1_hat_all <- c(beta_1_hat_all, beta_1_hat)
}

beta_df <- tibble(sd = rep(c(1, 3, 7), each = reps),
 beta_1_hat = beta_1_hat_all)

beta_df %>% 
  ggplot(aes(x = beta_1_hat_all, fill = factor(sd))) +
  geom_density(alpha = .4) +
  labs(x = expression(hat(beta)[1]), fill = expression(sigma)) +
  theme_bw(base_size = 16)
```
]

???

## Solution (start)

```r
reps <- 10000
n <- 1000

beta_0 <- 2
beta_1 <- .5
beta_2 <- 3
st_dev <- c(1, 3, 7)

beta_1_hat_all <- NULL

for (s in st_dev) {
 beta_1_hat <- numeric(reps)
 for (i in 1:reps) {
 X <- cbind(1, rnorm(n), rnorm(n) ^ 2)
 Y <- beta_0 + beta_1 * X[, 2, drop = FALSE] + 
 beta_2 * X[, 3, drop = FALSE] + rnorm(n, sd = s)
 beta_1_hat[i] <- (solve(t(X) %*% X) %*% t(X) %*% Y)[2, ]
 }
 beta_1_hat_all <- c(beta_1_hat_all, beta_1_hat)
}

beta_df <- tibble(sd = rep(st_dev, each = reps),
 beta_1_hat = beta_1_hat_all)

beta_df %>% 
  ggplot(aes(x = beta_1_hat_all, fill = factor(sd))) +
  geom_density(alpha = .4) +
  labs(x = expression(hat(beta)[1]), fill = expression(sigma)) +
  theme_bw(base_size = 16)
```
]

---

## Save profile

```r
library(profvis)
p <- profvis({
 reps <- 10000
 n <- 1000
 
 beta_0 <- 2; beta_1 <- .5; beta_2 <- 3
 
 beta_1_hat_all <- c()`
 for (s in c(1, 3, 7)) {
 beta_1_hat <- c()
 for (i in 1:reps) {
 X <- cbind(rnorm(n), rnorm(n) ^ 2)
 Y <- beta_0 + beta_1 * X[, 1, drop = FALSE] + 
 beta_2 * X[, 2, drop = FALSE] + rnorm(n, sd = s)
 m <- lm(Y~X)
 beta_1_hat <- c(beta_1_hat, coefficients(m)[2])
 }
 beta_1_hat_all <- c(beta_1_hat_all, beta_1_hat)
 }
 
 beta_df <- tibble(sd = rep(c(1, 3, 7), each = reps),
 beta_1_hat = beta_1_hat_all)
 
 beta_df %>% 
 ggplot(aes(x = beta_1_hat_all, fill = factor(sd))) +
 geom_density(alpha = .4) +
 labs(x = expression(hat(beta)), fill = expression(sigma)) +
 theme_bw(base_size = 16)
})
htmlwidgets::saveWidget(p, "profile.html")
```
]

<a href="./profile.html">Profiled code</a>

---

## References

- Getting Started with `doMC` and `foreach` by Steve Weston 
  https://cran.r-project.org/web/packages/doMC/vignettes/gettingstartedMC.pdf
  
- https://rstudio.github.io/profvis/