Improved performance comes from iteration, and learning the most common pitfalls
Don’t sweat the small stuff - Coder time vs Run time vs Compute costs
Measure it, or it didn’t happen
“Premature optimization is the root of all evil (or at least most of it) in programming.” -Knuth
Simplest tool is R’s base system.time
which can be used to wrap any other call or calls.
system.time({rnorm(1e6)})
## user system elapsed
## 0.116 0.004 0.121
system.time({rnorm(1e4) %*% t(rnorm(1e4))})
## user system elapsed
## 0.802 0.256 0.673
We can do better (better precision) using the microbenchmark package
install.packages("microbenchmark")
library(microbenchmark)
d = abs(rnorm(1000))
r = microbenchmark(
exp(log(d)/2),
d^0.5,
sqrt(d),
times = 1000
)
print(r)
## Unit: microseconds
## expr min lq mean median uq max neval
## exp(log(d)/2) 18.885 22.2520 27.099327 22.9090 25.0270 204.077 1000
## d^0.5 29.842 32.9425 40.330387 34.0590 36.0955 624.320 1000
## sqrt(d) 2.935 5.8760 7.983377 6.2765 7.1700 315.885 1000
boxplot(r)
We can also do better using the rbenchmark package
install.packages("rbenchmark")
library(rbenchmark)
d = abs(rnorm(1000))
benchmark(
exp(log(d)/2),
d^0.5,
sqrt(d),
replications = 1000,
order = "relative"
)
## test replications elapsed relative user.self sys.self user.child sys.child
## 3 sqrt(d) 1000 0.008 1.000 0.007 0.001 0 0
## 1 exp(log(d)/2) 1000 0.025 3.125 0.023 0.001 0 0
## 2 d^0.5 1000 0.040 5.000 0.037 0.002 0 0
Earlier we mentioned that growing a vector as you collect results is bad, just how bad is it? Benchmark the following three functions and compare their performance.
good = function() {
res = rep(NA, 1e4)
for(i in seq_along(res))
res[i] = sqrt(i)
}
bad = function() {
res = numeric()
for(i in 1:1e4)
res = c(res,sqrt(i))
}
best = function() {
sqrt(1:1e4)
}
Lets compare looping vs. the map (apply) function vs dplyr.
set.seed(523)
d = data_frame(
A = rnorm(1e5),
B = runif(1e5),
C = rexp(1e5),
D = sample(letters, 1e5, replace=TRUE)
)
Implement functions that will return a data frame containing the maximum value of each column
map
functionfor
loopBenchmark all of your preceding functions using data frame d
, which is the fastest, why do you think this is the case? 10 replicates per function is sufficient.
What would happen if d
was smaller? (e.g. take only the first 1000 rows)
parallel
Part of the base packages in R
tools for the forking of R processes (some functions do not work on Windows)
Core functions:
detectCores
pvec
mclapply
mcparallel
& mccollect
Surprisingly, detects the number of cores of the current system.
detectCores()
## [1] 24
Parallelization of a vectorized function call
system.time(pvec(1:1e7, sqrt, mc.cores = 1))
## user system elapsed
## 0.214 0.029 0.243
system.time(pvec(1:1e7, sqrt, mc.cores = 4))
## user system elapsed
## 0.442 0.185 0.631
system.time(pvec(1:1e7, sqrt, mc.cores = 8))
## user system elapsed
## 0.532 0.389 0.372
cores = c(1,2,4,8,16)
order = 6:8
res = map(
cores,
function(x) {
map_dbl(order, function(y) system.time(pvec(1:(10^y), sqrt, mc.cores=x))[3])
}
) %>% do.call(rbind,.)
rownames(res) = paste0(cores," cores")
colnames(res) = paste0("10^",order)
res
## 10^6 10^7 10^8
## 1 cores 0.016 0.282 1.489
## 2 cores 0.070 0.526 5.198
## 4 cores 0.052 0.430 5.023
## 8 cores 0.054 0.376 4.098
## 16 cores 0.073 0.401 4.049
Parallelized version of lapply
system.time(rnorm(1e6))
## user system elapsed
## 0.101 0.007 0.107
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 2)))
## user system elapsed
## 0.148 0.136 0.106
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4)))
## user system elapsed
## 0.242 0.061 0.052 ```
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4)))
## user system elapsed
## 0.097 0.047 0.079
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 8)))
## user system elapsed
## 0.193 0.076 0.040
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 10)))
## user system elapsed
## 0.162 0.083 0.041
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 12)))
## user system elapsed
## 0.098 0.065 0.037
Asynchronously evaluation of an R expression in a separate process
m = mcparallel(rnorm(1e6))
n = mcparallel(rbeta(1e6,1,1))
o = mcparallel(rgamma(1e6,1,1))
str(m)
## List of 2
## $ pid: int 17004
## $ fd : int [1:2] 4 7
## - attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process"
str(n)
## List of 2
## $ pid: int 17005
## $ fd : int [1:2] 5 9
## - attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process"
Checks mcparallel
objects for completion
str(mccollect(list(m,n,o)))
## List of 3
## $ 17004: num [1:1000000] 0.378 1.475 -0.66 0.695 2.37 ...
## $ 17005: num [1:1000000] 0.0108 0.6641 0.2954 0.5953 0.1836 ...
## $ 17006: num [1:1000000] 0.319 0.103 0.174 3.198 0.861 ...
p = mcparallel(mean(rnorm(1e5)))
mccollect(p, wait = FALSE, 10) # will retrieve the result (since it's fast)
## $`17007`
## [1] -0.00102628
mccollect(p, wait = FALSE) # will signal the job as terminating
## $`17007`
## NULL
mccollect(p, wait = FALSE) # there is no longer such a job
## NULL
Packages by Revolution Analytics that provides the foreach
function which is a parallelizable for
loop (and then some).
Core functions:
registerDoMC
foreach
, %dopar%
, %do%
Primarily used to set the number of cores used by foreach
, by default uses options("cores")
or half the number of cores found by detectCores
from the parallel package.
options("cores")
## $cores
## NULL
detectCores()
## [1] 24
getDoParWorkers()
## [1] 1
registerDoMC(4)
getDoParWorkers()
## [1] 4
A slightly more powerful version of base for
loops (think for
with an lapply
flavor). Combined with %do%
or %dopar%
for single or multicore execution.
for(i in 1:10) sqrt(i)
foreach(i = 1:5) %do% sqrt(i)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1.414214
##
## [[3]]
## [1] 1.732051
##
## [[4]]
## [1] 2
##
## [[5]]
## [1] 2.236068
foreach
can iterate across more than one value, but it doesn’t do length coercion
foreach(i = 1:5, j = 1:5) %do% sqrt(i^2+j^2)
## [[1]]
## [1] 1.414214
##
## [[2]]
## [1] 2.828427
##
## [[3]]
## [1] 4.242641
##
## [[4]]
## [1] 5.656854
##
## [[5]]
## [1] 7.071068
foreach(i = 1:5, j = 1:2) %do% sqrt(i^2+j^2)
## [[1]]
## [1] 1.414214
##
## [[2]]
## [1] 2.828427
foreach(i = 1:5, .combine='c') %do% sqrt(i)
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068
foreach(i = 1:5, .combine='cbind') %do% sqrt(i)
## result.1 result.2 result.3 result.4 result.5
## [1,] 1 1.414214 1.732051 2 2.236068
foreach(i = 1:5, .combine='+') %do% sqrt(i)
## [1] 8.382332
Swapping out %do%
for %dopar%
will use the parallel backend.
registerDoMC(4)
system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))
## user system elapsed
## 0.428 0.058 0.604
registerDoMC(8)
system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))
## user system elapsed
## 1.516 0.182 0.646
registerDoMC(12)
system.time(foreach(i = 1:10) %dopar% mean(rnorm(1e6)))
## user system elapsed
## 0.937 0.123 0.360
Bootstrapping is a resampling scheme where the original data is repeatedly reconstructed by taking a sample (with replacement) of the same size as the original data, and using that to conduct whatever analysis procedure is of interest. Below is an example of fitting a local regression (loess
) to some synthetic data, we will construct a bootstrap prediction interval for this model.
set.seed(3212016)
d = data.frame(x = 1:120) %>%
mutate(y = sin(2*pi*x/120) + runif(length(x),-1,1))
l = loess(y ~ x, data=d)
d = d %>% mutate(
pred_y = predict(l),
pred_y_se = predict(l,se=TRUE)$se.fit
) %>% mutate(
pred_low = pred_y - 1.96 * pred_y_se,
pred_high = pred_y + 1.96 * pred_y_se
)
ggplot(d, aes(x,y)) +
geom_point(color="darkgrey") +
geom_ribbon(aes(ymin=pred_low, ymax=pred_high), fill="red", alpha=0.25) +
geom_line(aes(y=pred_y)) +
theme_bw()
We will now re-implement the code below using one of the parallelization techniques we have just discussed and will then check the performance with 1, 2, and 4 cores.
n_rep = 5000
d_xy = select(d, x, y)
res = map(1:n_rep, function(i) {
d_xy %>%
select(x,y) %>%
sample_n(nrow(d), replace=TRUE) %>%
loess(y ~ x, data=.) %>%
predict(newdata=d) %>%
setNames(NULL)
}) %>% do.call(cbind, .)
d = d %>% mutate(
bs_low = apply(res,1,quantile,probs=c(0.025), na.rm=TRUE),
bs_high = apply(res,1,quantile,probs=c(0.975), na.rm=TRUE)
)
ggplot(d, aes(x,y)) +
geom_point(color="gray50") +
geom_ribbon(aes(ymin=pred_low, ymax=pred_high), fill="red", alpha=0.25) +
geom_ribbon(aes(ymin=bs_low, ymax=bs_high), fill="blue", alpha=0.25) +
geom_line(aes(y=pred_y)) +
theme_bw()
Optimal use of multiple cores is hard, there isn’t one best solution
Don’t underestimate the overhead cost
More art than science - experimentation is key
Measure it or it didn’t happen
Be aware of the trade off between developer time and run time
An awful lot of statistics is at its core linear algebra.
For example:
\[ \hat{\beta} = (X^T X)^{-1} X^Ty \]
Principle component analysis
Find \(T = XW\) where \(W\) is a matrix whose columns are the eigenvectors of \(X^TX\).
Often solved via SVD - Let \(X = U\Sigma W^T\) then \(T = U\Sigma\).
Not unique to Statistics, these are the type of problems that come up across all areas of numerical computing.
Efficiency and stability of numerical algorithms matter
Low level algorithms for common linear algebra operations
BLAS
Basic Linear Algebra Subprograms
Copying, scaling, multiplying vectors and matrices
Origins go back to 1979, written in Fortran
LAPACK
Linear Algebra Package
Higher level functionality building on BLAS.
Linear solvers, eigenvalues, and matrix decompositions
Origins go back to 1992, mostly Fortran (expanded on LINPACK, EISPACK)
Most default BLAS and LAPACK implementations (like R’s defaults) are somewhat dated
Written in Fortran and designed for a single cpu core
Certain (potentially non-optimal) hard coded defaults (e.g. block size).
Multithreaded alternatives:
ATLAS - Automatically Tuned Linear Algebra Software
OpenBLAS - fork of GotoBLAS from TACC at UTexas
Intel MKL - Math Kernel Library, part of Intel’s commercial compiler tools
cuBLAS / Magma - GPU libraries from NVidia and UTK respectively
n | 1 core | 2 cores | 4 cores | 8 cores |
---|---|---|---|---|
100 | 0.001 | 0.001 | 0.000 | 0.000 |
500 | 0.018 | 0.011 | 0.008 | 0.008 |
1000 | 0.128 | 0.068 | 0.041 | 0.036 |
2000 | 0.930 | 0.491 | 0.276 | 0.162 |
3000 | 3.112 | 1.604 | 0.897 | 0.489 |
4000 | 7.330 | 3.732 | 1.973 | 1.188 |
5000 | 14.223 | 7.341 | 3.856 | 2.310 |
M | N | K | MAGMA (ms) | cuBLAS (ms) | CPU (ms) |
---|---|---|---|---|---|
1088 | 1088 | 1088 | 4.68 | 5.98 | 69.54 |
2112 | 2112 | 2112 | 29.79 | 17.33 | 329.42 |
3136 | 3136 | 3136 | 98.68 | 54.17 | 958.82 |
4160 | 4160 | 4160 | 230.35 | 125.54 | 2119.71 |
5184 | 5184 | 5184 | 448.66 | 240.91 | 3998.74 |
6208 | 6208 | 6208 | 772.88 | 412.61 | 6801.51 |
7232 | 7232 | 7232 | 1224.33 | 650.76 | 10629.33 |
8256 | 8256 | 8256 | 1823.29 | 967.20 | 15740.96 |
9280 | 9280 | 9280 | 2590.17 | 1370.57 | 22367.01 |
10304 | 10304 | 10304 | 3517.43 | 1872.64 | 30597.85 |
M | N | K | MAGMA (ms) | cuBLAS (ms) | CPU (ms) |
---|---|---|---|---|---|
1088 | 1088 | 1088 | 0.94 | 0.72 | 64.82 |
2112 | 2112 | 2112 | 5.60 | 4.71 | 154.75 |
3136 | 3136 | 3136 | 17.71 | 13.82 | 357.92 |
4160 | 4160 | 4160 | 41.45 | 32.43 | 741.42 |
5184 | 5184 | 5184 | 80.21 | 62.28 | 1293.62 |
6208 | 6208 | 6208 | 138.34 | 106.78 | 2223.03 |
7232 | 7232 | 7232 | 218.60 | 167.36 | 3413.69 |
8256 | 8256 | 8256 | 326.42 | 249.71 | 4892.46 |
9280 | 9280 | 9280 | 465.48 | 353.87 | 6999.95 |
10304 | 10304 | 10304 | 638.43 | 483.86 | 9547.43 |