BLAS and LAPACK

Statistics and Linear Algebra

An awful lot of statistics is at its core linear algebra.

For example:

Linear regession models, find

β̂ = (X^TX)^− 1X^Ty

Principle component analysis
- Find T = XW where W is a matrix whose columns are the eigenvectors of X^TX.
- Often solved via SVD - Let X = UΣW^T then T = UΣ.

Numerical Linear Algebra

Not unique to Statistics, these are the type of problems that come up across all areas of numerical computing.

Numerical linear algebra ≠ mathematical linear algebra

Efficiency and stability of numerical algorithms matter
- Designing and implementing these algorithms is hard

Don’t reinvent the wheel - common core linear algebra tools (well defined API)

BLAS and LAPACK

Low level algorithms for common linear algebra operations

BLAS

Basic Linear Algebra Subprograms
Copying, scaling, multiplying vectors and matrices
Origins go back to 1979, written in Fortran

LAPACK

Linear Algebra Package
Higher level functionality building on BLAS.
Linear solvers, eigenvalues, and matrix decompositions
Origins go back to 1992, mostly Fortran (expanded on LINPACK, EISPACK)

R + BLAS/LAPACK

cr173@saxon [~]$ ldd /usr/lib64/R/bin/exec/R 
        linux-vdso.so.1 =>  (0x00007fff922a3000)
        libR.so => /usr/lib64/R/lib/libR.so (0x000000378d400000)
        libRblas.so => /usr/lib64/R/lib/libRblas.so (0x000000378da00000)
        libgomp.so.1 => /usr/lib64/libgomp.so.1 (0x000000378c000000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003787c00000)
        libc.so.6 => /lib64/libc.so.6 (0x0000003787400000)
        libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x000000378cc00000)
        libm.so.6 => /lib64/libm.so.6 (0x0000003787800000)
        libreadline.so.6 => /lib64/libreadline.so.6 (0x000000378f400000)
        librt.so.1 => /lib64/librt.so.1 (0x0000003788800000)
        libdl.so.2 => /lib64/libdl.so.2 (0x0000003788000000)
        libicuuc.so.42 => /usr/lib64/libicuuc.so.42 (0x0000003796a00000)
        libicui18n.so.42 => /usr/lib64/libicui18n.so.42 (0x0000003796600000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003789000000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003787000000)
        libtinfo.so.5 => /lib64/libtinfo.so.5 (0x000000378d000000)
        libicudata.so.42 => /usr/lib64/libicudata.so.42 (0x0000003795400000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003789400000)

Modern variants?

Most available BLAS and LAPACK implementations (like R’s defaults) are somewhat dated

Designed for a single core cpu
Certain (potentially non-optimal) hard coded defaults (e.g. block size).

Multithreaded alternatives:

ATLAS - Automatically Tuned Linear Algebra Software
OpenBLAS - fork of GotoBLAS from TACC at UTexas
Intel MKL - Math Kernel Library, part of Intel’s commercial compiler tools
cuBLAS / Magma - hybrid CPU / GPU library from the ICL at UTK

Naming conventions

A BLAS and LAPACK subroutines are named using form pmmaaa where:

p is a one letter code for the type of data
- S single precision floating point
- D double precision floating point
- C complex single precision floating point
- Z complex double precision floating point
mm is a two letter code for the type of matrix expected by the subroutine
aaa is a one to three letter code denoting the algorithm implemented by subroutine

BLAS Examples

DGEMM

D - expects data of type double
GE - expects a general matrix (no useful structure)
MM - matrix / matrix multiplication operation

STRMV

S - expects data of type double
TR - expects a triangular
MV - matrix / vector multiplication operation

DGEMM details

dgemm(   character   TRANSA,
         character   TRANSB,
         integer     M,
         integer     N,
         integer     K,
         double precision    ALPHA,
         double precision, dimension(lda,*)  A,
         integer     LDA,
         double precision, dimension(ldb,*)  B,
         integer     LDB,
         double precision    BETA,
         double precision, dimension(ldc,*)  C,
         integer     LDC 
     )

DGEMM performs one of the matrix-matrix operations

C = αop(A) × op(B) + βC

where op(X) is either op(X) = X or op(X) = X^T, α and β are scalars, and A, B and C are matrices, with op(A) an m by k matrix, op(B) a k by n matrix and C an m by n matrix.

STRMV details

dtrmv(   character   UPLO,
         character   TRANS,
         character   DIAG,
         integer     N,
         double precision, dimension(lda,*)  A,
         integer     LDA,
         double precision, dimension(*)  X,
         integer     INCX 
     )

STRMV performs one of the matrix-vector operations

x = A x or x = A^Tx

where x is an n element vector and A is an n by n unit, or non-unit, upper or lower triangular matrix.

LAPACK Example

DPOTRF

D - expects data of type double
PO - positive definite matrix
TRF - triangular factorization

DPOTRF details

dpotrf(  character   UPLO,
         integer     N,
         double precision, dimension( lda, * )   A,
         integer     LDA,
         integer     INFO 
      )

DPOTRF computes the Cholesky factorization of a real symmetric positive definite matrix A.

The factorization has the form
A = U^T * U, if UPLO = ’U’, or

A = L * L^T, if UPLO = ’L’,
where U is an upper triangular matrix and L is lower triangular.

Parallel vs Base BLAS

R using default BLAS (/usr/bin/Rscript) vs OpenBLAS (/usr/local/bin/Rscript)

$ /usr/bin/Rscript -e "x=matrix(runif(1000^2),ncol=1000);system.time(x%*%x)"
   user  system elapsed 
  1.258   0.010   1.268

$ /usr/local/bin/Rscript -e "x=matrix(runif(1000^2),ncol=1000);system.time(x%*%x)"
   user  system elapsed 
  0.153   0.034   0.032 


$ /usr/bin/Rscript -e "x=matrix(runif(5000^2),ncol=5000);system.time(x%*%x)"
   user  system elapsed 
134.687   0.055 134.791 

$ /usr/local/bin/Rscript -e "x=matrix(runif(5000^2),ncol=5000);system.time(x%*%x)"
   user  system elapsed 
 16.011   2.571   2.483

Optimization beyond parallelism?

$ /usr/bin/Rscript -e "x=matrix(runif(1000^2),ncol=1000);system.time(x%*%x)"
  user  system elapsed 
 1.258   0.010   1.268

$ /usr/local/bin/Rscript -e "library(OpenBlasThreads);openblas.set.num.threads(1);x=matrix(runif(1000^2),ncol=1000);system.time(x%*%x)"
  user  system elapsed 
 0.123   0.004   0.127 


$ /usr/bin/Rscript -e "x=matrix(runif(5000^2),ncol=5000);system.time(x%*%x)"
   user  system elapsed 
134.687   0.055 134.791 

$ /usr/local/bin/Rscript -e "library(OpenBlasThreads);openblas.set.num.threads(1);x=matrix(runif(5000^2),ncol=5000);system.time(x%*%x)"
   user  system elapsed 
 14.197   0.057  14.250

OpenBLAS DGEMM Performance

library(OpenBlasThreads)
x=matrix(runif(5000^2),ncol=5000)

sizes = c(100,500,1000,2000,3000,4000,5000)
cores = c(1,2,4,8)

sapply(cores, 
       function(c) 
       {
            openblas.set.num.threads(c)
            sapply(sizes, 
                   function(s) 
                   {
                        y = x[1:s,1:s]
                        system.time(y %*% y)[3]
                   })
       })

n	1 core	2 cores	4 cores	8 cores
100	0.001	0.001	0.000	0.000
500	0.018	0.011	0.008	0.008
1000	0.128	0.068	0.041	0.036
2000	0.930	0.491	0.276	0.162
3000	3.112	1.604	0.897	0.489
4000	7.330	3.732	1.973	1.188
5000	14.223	7.341	3.856	2.310

OpenBLAS DPOTRF Performance

library(OpenBlasThreads)
library(clusterGeneration)
x=genPositiveDefMat(5000)$Sigma

sizes = c(100,500,1000,2000,3000,4000,5000)
cores = c(1,2,4,8)

sapply(cores, 
       function(c) 
       {
            openblas.set.num.threads(c)
            sapply(sizes, 
                   function(s) 
                   {
                        y = x[1:s,1:s]
                        system.time(chol(y))[3]
                   })
       })

n	1 core	2 cores	4 cores	8 cores
100	0.000	0.001	0.000	0.000
500	0.006	0.004	0.003	0.002
1000	0.031	0.023	0.013	0.010
2000	0.211	0.135	0.074	0.049
3000	0.666	0.401	0.212	0.157
4000	1.530	0.903	0.460	0.283
5000	2.593	1.749	0.899	0.565

Base BLAS takes ~26 secs for a 5000 x 5000 matrix.