An awful lot of statistics is at its core linear algebra.
For example:
β̂ = (XTX) − 1XTy
Principle component analysis
Find T = XW where W is a matrix whose columns are the eigenvectors of XTX.
Often solved via SVD - Let X = UΣWT then T = UΣ.
Not unique to Statistics, these are the type of problems that come up across all areas of numerical computing.
Efficiency and stability of numerical algorithms matter
Low level algorithms for common linear algebra operations
BLAS
Basic Linear Algebra Subprograms
Copying, scaling, multiplying vectors and matrices
Origins go back to 1979, written in Fortran
LAPACK
Linear Algebra Package
Higher level functionality building on BLAS.
Linear solvers, eigenvalues, and matrix decompositions
Origins go back to 1992, mostly Fortran (expanded on LINPACK, EISPACK)
cr173@saxon [~]$ ldd /usr/lib64/R/bin/exec/R
linux-vdso.so.1 => (0x00007fff922a3000)
libR.so => /usr/lib64/R/lib/libR.so (0x000000378d400000)
libRblas.so => /usr/lib64/R/lib/libRblas.so (0x000000378da00000)
libgomp.so.1 => /usr/lib64/libgomp.so.1 (0x000000378c000000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003787c00000)
libc.so.6 => /lib64/libc.so.6 (0x0000003787400000)
libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x000000378cc00000)
libm.so.6 => /lib64/libm.so.6 (0x0000003787800000)
libreadline.so.6 => /lib64/libreadline.so.6 (0x000000378f400000)
librt.so.1 => /lib64/librt.so.1 (0x0000003788800000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003788000000)
libicuuc.so.42 => /usr/lib64/libicuuc.so.42 (0x0000003796a00000)
libicui18n.so.42 => /usr/lib64/libicui18n.so.42 (0x0000003796600000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003789000000)
/lib64/ld-linux-x86-64.so.2 (0x0000003787000000)
libtinfo.so.5 => /lib64/libtinfo.so.5 (0x000000378d000000)
libicudata.so.42 => /usr/lib64/libicudata.so.42 (0x0000003795400000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003789400000)
Most available BLAS and LAPACK implementations (like R’s defaults) are somewhat dated
Designed for a single core cpu
Certain (potentially non-optimal) hard coded defaults (e.g. block size).
Multithreaded alternatives:
ATLAS - Automatically Tuned Linear Algebra Software
OpenBLAS - fork of GotoBLAS from TACC at UTexas
Intel MKL - Math Kernel Library, part of Intel’s commercial compiler tools
cuBLAS / Magma - hybrid CPU / GPU library from the ICL at UTK
A BLAS and LAPACK subroutines are named using form pmmaaa
where:
p
is a one letter code for the type of data
S
single precision floating pointD
double precision floating pointC
complex single precision floating pointZ
complex double precision floating pointmm
is a two letter code for the type of matrix expected by the subroutine
aaa
is a one to three letter code denoting the algorithm implemented by subroutine
DGEMM
D
- expects data of type double
GE
- expects a general matrix (no useful structure)
MM
- matrix / matrix multiplication operation
STRMV
S
- expects data of type double
TR
- expects a triangular
MV
- matrix / vector multiplication operation
dgemm( character TRANSA,
character TRANSB,
integer M,
integer N,
integer K,
double precision ALPHA,
double precision, dimension(lda,*) A,
integer LDA,
double precision, dimension(ldb,*) B,
integer LDB,
double precision BETA,
double precision, dimension(ldc,*) C,
integer LDC
)
DGEMM
performs one of the matrix-matrix operations
C = αop(A) × op(B) + βC
where op(X) is either op(X) = X or op(X) = XT, α and β are scalars, and A, B and C are matrices, with op(A) an m by k matrix, op(B) a k by n matrix and C an m by n matrix.
dtrmv( character UPLO,
character TRANS,
character DIAG,
integer N,
double precision, dimension(lda,*) A,
integer LDA,
double precision, dimension(*) X,
integer INCX
)
STRMV
performs one of the matrix-vector operations
x = A x or x = ATx
where x is an n element vector and A is an n by n unit, or non-unit, upper or lower triangular matrix.
DPOTRF
D
- expects data of type double
PO
- positive definite matrix
TRF
- triangular factorization
dpotrf( character UPLO,
integer N,
double precision, dimension( lda, * ) A,
integer LDA,
integer INFO
)
DPOTRF
computes the Cholesky factorization of a real symmetric positive definite matrix A.
The factorization has the form
A = UT * U, if UPLO = ’U’, or
A = L * LT, if UPLO = ’L’,
where U is an upper triangular matrix and L is lower triangular.
R using default BLAS (/usr/bin/Rscript
) vs OpenBLAS (/usr/local/bin/Rscript
)
$ /usr/bin/Rscript -e "x=matrix(runif(1000^2),ncol=1000);system.time(x%*%x)"
user system elapsed
1.258 0.010 1.268
$ /usr/local/bin/Rscript -e "x=matrix(runif(1000^2),ncol=1000);system.time(x%*%x)"
user system elapsed
0.153 0.034 0.032
$ /usr/bin/Rscript -e "x=matrix(runif(5000^2),ncol=5000);system.time(x%*%x)"
user system elapsed
134.687 0.055 134.791
$ /usr/local/bin/Rscript -e "x=matrix(runif(5000^2),ncol=5000);system.time(x%*%x)"
user system elapsed
16.011 2.571 2.483
$ /usr/bin/Rscript -e "x=matrix(runif(1000^2),ncol=1000);system.time(x%*%x)"
user system elapsed
1.258 0.010 1.268
$ /usr/local/bin/Rscript -e "library(OpenBlasThreads);openblas.set.num.threads(1);x=matrix(runif(1000^2),ncol=1000);system.time(x%*%x)"
user system elapsed
0.123 0.004 0.127
$ /usr/bin/Rscript -e "x=matrix(runif(5000^2),ncol=5000);system.time(x%*%x)"
user system elapsed
134.687 0.055 134.791
$ /usr/local/bin/Rscript -e "library(OpenBlasThreads);openblas.set.num.threads(1);x=matrix(runif(5000^2),ncol=5000);system.time(x%*%x)"
user system elapsed
14.197 0.057 14.250
library(OpenBlasThreads)
x=matrix(runif(5000^2),ncol=5000)
sizes = c(100,500,1000,2000,3000,4000,5000)
cores = c(1,2,4,8)
sapply(cores,
function(c)
{
openblas.set.num.threads(c)
sapply(sizes,
function(s)
{
y = x[1:s,1:s]
system.time(y %*% y)[3]
})
})
n | 1 core | 2 cores | 4 cores | 8 cores |
---|---|---|---|---|
100 | 0.001 | 0.001 | 0.000 | 0.000 |
500 | 0.018 | 0.011 | 0.008 | 0.008 |
1000 | 0.128 | 0.068 | 0.041 | 0.036 |
2000 | 0.930 | 0.491 | 0.276 | 0.162 |
3000 | 3.112 | 1.604 | 0.897 | 0.489 |
4000 | 7.330 | 3.732 | 1.973 | 1.188 |
5000 | 14.223 | 7.341 | 3.856 | 2.310 |
library(OpenBlasThreads)
library(clusterGeneration)
x=genPositiveDefMat(5000)$Sigma
sizes = c(100,500,1000,2000,3000,4000,5000)
cores = c(1,2,4,8)
sapply(cores,
function(c)
{
openblas.set.num.threads(c)
sapply(sizes,
function(s)
{
y = x[1:s,1:s]
system.time(chol(y))[3]
})
})
n | 1 core | 2 cores | 4 cores | 8 cores |
---|---|---|---|---|
100 | 0.000 | 0.001 | 0.000 | 0.000 |
500 | 0.006 | 0.004 | 0.003 | 0.002 |
1000 | 0.031 | 0.023 | 0.013 | 0.010 |
2000 | 0.211 | 0.135 | 0.074 | 0.049 |
3000 | 0.666 | 0.401 | 0.212 | 0.157 |
4000 | 1.530 | 0.903 | 0.460 | 0.283 |
5000 | 2.593 | 1.749 | 0.899 | 0.565 |
Base BLAS takes ~26 secs for a 5000 x 5000 matrix.