# BLAS and LAPACK

## Statistics and Linear Algebra

An awful lot of statistics is at its core linear algebra.

For example:

• Linear regession models, find

β̂ = (XTX) − 1XTy

• Principle component analysis

• Find T = XW where W is a matrix whose columns are the eigenvectors of XTX.

• Often solved via SVD - Let X = UΣWT then T = UΣ.

## Numerical Linear Algebra

Not unique to Statistics, these are the type of problems that come up across all areas of numerical computing.

• Numerical linear algebra  ≠  mathematical linear algebra

• Efficiency and stability of numerical algorithms matter

• Designing and implementing these algorithms is hard

• Don’t reinvent the wheel - common core linear algebra tools (well defined API)

## BLAS and LAPACK

Low level algorithms for common linear algebra operations

BLAS

• Basic Linear Algebra Subprograms

• Copying, scaling, multiplying vectors and matrices

• Origins go back to 1979, written in Fortran

LAPACK

• Linear Algebra Package

• Higher level functionality building on BLAS.

• Linear solvers, eigenvalues, and matrix decompositions

• Origins go back to 1992, mostly Fortran (expanded on LINPACK, EISPACK)

cr173@saxon [~]$ldd /usr/lib64/R/bin/exec/R linux-vdso.so.1 => (0x00007fff922a3000) libR.so => /usr/lib64/R/lib/libR.so (0x000000378d400000) libRblas.so => /usr/lib64/R/lib/libRblas.so (0x000000378da00000) libgomp.so.1 => /usr/lib64/libgomp.so.1 (0x000000378c000000) libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003787c00000) libc.so.6 => /lib64/libc.so.6 (0x0000003787400000) libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x000000378cc00000) libm.so.6 => /lib64/libm.so.6 (0x0000003787800000) libreadline.so.6 => /lib64/libreadline.so.6 (0x000000378f400000) librt.so.1 => /lib64/librt.so.1 (0x0000003788800000) libdl.so.2 => /lib64/libdl.so.2 (0x0000003788000000) libicuuc.so.42 => /usr/lib64/libicuuc.so.42 (0x0000003796a00000) libicui18n.so.42 => /usr/lib64/libicui18n.so.42 (0x0000003796600000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003789000000) /lib64/ld-linux-x86-64.so.2 (0x0000003787000000) libtinfo.so.5 => /lib64/libtinfo.so.5 (0x000000378d000000) libicudata.so.42 => /usr/lib64/libicudata.so.42 (0x0000003795400000) libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003789400000) ## Modern variants? Most available BLAS and LAPACK implementations (like R’s defaults) are somewhat dated • Designed for a single core cpu • Certain (potentially non-optimal) hard coded defaults (e.g. block size). Multithreaded alternatives: • ATLAS - Automatically Tuned Linear Algebra Software • OpenBLAS - fork of GotoBLAS from TACC at UTexas • Intel MKL - Math Kernel Library, part of Intel’s commercial compiler tools • cuBLAS / Magma - hybrid CPU / GPU library from the ICL at UTK ## Naming conventions A BLAS and LAPACK subroutines are named using form pmmaaa where: • p is a one letter code for the type of data • S single precision floating point • D double precision floating point • C complex single precision floating point • Z complex double precision floating point • mm is a two letter code for the type of matrix expected by the subroutine • aaa is a one to three letter code denoting the algorithm implemented by subroutine ## BLAS Examples DGEMM • D - expects data of type double • GE - expects a general matrix (no useful structure) • MM - matrix / matrix multiplication operation STRMV • S - expects data of type double • TR - expects a triangular • MV - matrix / vector multiplication operation ## DGEMM details dgemm( character TRANSA, character TRANSB, integer M, integer N, integer K, double precision ALPHA, double precision, dimension(lda,*) A, integer LDA, double precision, dimension(ldb,*) B, integer LDB, double precision BETA, double precision, dimension(ldc,*) C, integer LDC )  DGEMM performs one of the matrix-matrix operations C = αop(A) × op(B) + βC where op(X) is either op(X) = X or op(X) = XT, α and β are scalars, and A, B and C are matrices, with op(A) an m by k matrix, op(B) a k by n matrix and C an m by n matrix. ## STRMV details dtrmv( character UPLO, character TRANS, character DIAG, integer N, double precision, dimension(lda,*) A, integer LDA, double precision, dimension(*) X, integer INCX )  STRMV performs one of the matrix-vector operations x = Ax or x = ATx where x is an n element vector and A is an n by n unit, or non-unit, upper or lower triangular matrix. ## LAPACK Example DPOTRF • D - expects data of type double • PO - positive definite matrix • TRF - triangular factorization ## DPOTRF details dpotrf( character UPLO, integer N, double precision, dimension( lda, * ) A, integer LDA, integer INFO )  DPOTRF computes the Cholesky factorization of a real symmetric positive definite matrix A. The factorization has the form A = UT * U, if UPLO = ’U’, or A = L * LT, if UPLO = ’L’, where U is an upper triangular matrix and L is lower triangular. ## Parallel vs Base BLAS R using default BLAS (/usr/bin/Rscript) vs OpenBLAS (/usr/local/bin/Rscript) $ /usr/bin/Rscript -e "x=matrix(runif(1000^2),ncol=1000);system.time(x%*%x)"
user  system elapsed
1.258   0.010   1.268

$/usr/local/bin/Rscript -e "x=matrix(runif(1000^2),ncol=1000);system.time(x%*%x)" user system elapsed 0.153 0.034 0.032$ /usr/bin/Rscript -e "x=matrix(runif(5000^2),ncol=5000);system.time(x%*%x)"
user  system elapsed
134.687   0.055 134.791

$/usr/local/bin/Rscript -e "x=matrix(runif(5000^2),ncol=5000);system.time(x%*%x)" user system elapsed 16.011 2.571 2.483  ## Optimization beyond parallelism? $ /usr/bin/Rscript -e "x=matrix(runif(1000^2),ncol=1000);system.time(x%*%x)"
user  system elapsed
1.258   0.010   1.268

$/usr/local/bin/Rscript -e "library(OpenBlasThreads);openblas.set.num.threads(1);x=matrix(runif(1000^2),ncol=1000);system.time(x%*%x)" user system elapsed 0.123 0.004 0.127$ /usr/bin/Rscript -e "x=matrix(runif(5000^2),ncol=5000);system.time(x%*%x)"
user  system elapsed
134.687   0.055 134.791

$/usr/local/bin/Rscript -e "library(OpenBlasThreads);openblas.set.num.threads(1);x=matrix(runif(5000^2),ncol=5000);system.time(x%*%x)" user system elapsed 14.197 0.057 14.250  ## OpenBLAS DGEMM Performance library(OpenBlasThreads) x=matrix(runif(5000^2),ncol=5000) sizes = c(100,500,1000,2000,3000,4000,5000) cores = c(1,2,4,8) sapply(cores, function(c) { openblas.set.num.threads(c) sapply(sizes, function(s) { y = x[1:s,1:s] system.time(y %*% y)[3] }) }) n 1 core 2 cores 4 cores 8 cores 100 0.001 0.001 0.000 0.000 500 0.018 0.011 0.008 0.008 1000 0.128 0.068 0.041 0.036 2000 0.930 0.491 0.276 0.162 3000 3.112 1.604 0.897 0.489 4000 7.330 3.732 1.973 1.188 5000 14.223 7.341 3.856 2.310 ## OpenBLAS DPOTRF Performance library(OpenBlasThreads) library(clusterGeneration) x=genPositiveDefMat(5000)$Sigma

sizes = c(100,500,1000,2000,3000,4000,5000)
cores = c(1,2,4,8)

sapply(cores,
function(c)
{
sapply(sizes,
function(s)
{
y = x[1:s,1:s]
system.time(chol(y))[3]
})
})
n 1 core 2 cores 4 cores 8 cores
100 0.000 0.001 0.000 0.000
500 0.006 0.004 0.003 0.002
1000 0.031 0.023 0.013 0.010
2000 0.211 0.135 0.074 0.049
3000 0.666 0.401 0.212 0.157
4000 1.530 0.903 0.460 0.283
5000 2.593 1.749 0.899 0.565

Base BLAS takes ~26 secs for a 5000 x 5000 matrix.