BLAS and LAPACK


BLAS and LAPACK

Statistics and Linear Algebra

An awful lot of statistics is at its core linear algebra.


For example:

  • Linear regession models, find


β̂ = (XTX) − 1XTy

  • Principle component analysis

    • Find T = XW where W is a matrix whose columns are the eigenvectors of XTX.

    • Often solved via SVD - Let X = UΣWT then T = UΣ.

Numerical Linear Algebra

Not unique to Statistics, these are the type of problems that come up across all areas of numerical computing.

  • Numerical linear algebra  ≠  mathematical linear algebra


  • Efficiency and stability of numerical algorithms matter

    • Designing and implementing these algorithms is hard


  • Don’t reinvent the wheel - common core linear algebra tools (well defined API)

BLAS and LAPACK

Low level algorithms for common linear algebra operations


BLAS

  • Basic Linear Algebra Subprograms

  • Copying, scaling, multiplying vectors and matrices

  • Origins go back to 1979, written in Fortran


LAPACK

  • Linear Algebra Package

  • Higher level functionality building on BLAS.

  • Linear solvers, eigenvalues, and matrix decompositions

  • Origins go back to 1992, mostly Fortran (expanded on LINPACK, EISPACK)

R + BLAS/LAPACK

cr173@saxon [~]$ ldd /usr/lib64/R/bin/exec/R 
        linux-vdso.so.1 =>  (0x00007fff922a3000)
        libR.so => /usr/lib64/R/lib/libR.so (0x000000378d400000)
        libRblas.so => /usr/lib64/R/lib/libRblas.so (0x000000378da00000)
        libgomp.so.1 => /usr/lib64/libgomp.so.1 (0x000000378c000000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003787c00000)
        libc.so.6 => /lib64/libc.so.6 (0x0000003787400000)
        libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x000000378cc00000)
        libm.so.6 => /lib64/libm.so.6 (0x0000003787800000)
        libreadline.so.6 => /lib64/libreadline.so.6 (0x000000378f400000)
        librt.so.1 => /lib64/librt.so.1 (0x0000003788800000)
        libdl.so.2 => /lib64/libdl.so.2 (0x0000003788000000)
        libicuuc.so.42 => /usr/lib64/libicuuc.so.42 (0x0000003796a00000)
        libicui18n.so.42 => /usr/lib64/libicui18n.so.42 (0x0000003796600000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003789000000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003787000000)
        libtinfo.so.5 => /lib64/libtinfo.so.5 (0x000000378d000000)
        libicudata.so.42 => /usr/lib64/libicudata.so.42 (0x0000003795400000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003789400000)

Modern variants?

Most available BLAS and LAPACK implementations (like R’s defaults) are somewhat dated

  • Designed for a single core cpu

  • Certain (potentially non-optimal) hard coded defaults (e.g. block size).


Multithreaded alternatives:

  • ATLAS - Automatically Tuned Linear Algebra Software

  • OpenBLAS - fork of GotoBLAS from TACC at UTexas

  • Intel MKL - Math Kernel Library, part of Intel’s commercial compiler tools

  • cuBLAS / Magma - hybrid CPU / GPU library from the ICL at UTK

Naming conventions

A BLAS and LAPACK subroutines are named using form pmmaaa where:

  • p is a one letter code for the type of data

    • S single precision floating point
    • D double precision floating point
    • C complex single precision floating point
    • Z complex double precision floating point
  • mm is a two letter code for the type of matrix expected by the subroutine

  • aaa is a one to three letter code denoting the algorithm implemented by subroutine

BLAS Examples

DGEMM

  • D - expects data of type double

  • GE - expects a general matrix (no useful structure)

  • MM - matrix / matrix multiplication operation


STRMV

  • S - expects data of type double

  • TR - expects a triangular

  • MV - matrix / vector multiplication operation

DGEMM details

dgemm(   character   TRANSA,
         character   TRANSB,
         integer     M,
         integer     N,
         integer     K,
         double precision    ALPHA,
         double precision, dimension(lda,*)  A,
         integer     LDA,
         double precision, dimension(ldb,*)  B,
         integer     LDB,
         double precision    BETA,
         double precision, dimension(ldc,*)  C,
         integer     LDC 
     )   

DGEMM performs one of the matrix-matrix operations


C = αop(A) × op(B) + βC

where op(X) is either op(X) = X or op(X) = XT, α and β are scalars, and A, B and C are matrices, with op(A) an m by k matrix, op(B) a k by n matrix and C an m by n matrix.

STRMV details

dtrmv(   character   UPLO,
         character   TRANS,
         character   DIAG,
         integer     N,
         double precision, dimension(lda,*)  A,
         integer     LDA,
         double precision, dimension(*)  X,
         integer     INCX 
     )   

STRMV performs one of the matrix-vector operations


x = Ax        or         x = ATx

where x is an n element vector and A is an n by n unit, or non-unit, upper or lower triangular matrix.

LAPACK Example

DPOTRF

  • D - expects data of type double

  • PO - positive definite matrix

  • TRF - triangular factorization

DPOTRF details

dpotrf(  character   UPLO,
         integer     N,
         double precision, dimension( lda, * )   A,
         integer     LDA,
         integer     INFO 
      )   

DPOTRF computes the Cholesky factorization of a real symmetric positive definite matrix A.

The factorization has the form
A = UT * U, if UPLO = ’U’, or

A = L * LT, if UPLO = ’L’,
where U is an upper triangular matrix and L is lower triangular.

Parallel vs Base BLAS

R using default BLAS (/usr/bin/Rscript) vs OpenBLAS (/usr/local/bin/Rscript)

$ /usr/bin/Rscript -e "x=matrix(runif(1000^2),ncol=1000);system.time(x%*%x)"
   user  system elapsed 
  1.258   0.010   1.268

$ /usr/local/bin/Rscript -e "x=matrix(runif(1000^2),ncol=1000);system.time(x%*%x)"
   user  system elapsed 
  0.153   0.034   0.032 


$ /usr/bin/Rscript -e "x=matrix(runif(5000^2),ncol=5000);system.time(x%*%x)"
   user  system elapsed 
134.687   0.055 134.791 

$ /usr/local/bin/Rscript -e "x=matrix(runif(5000^2),ncol=5000);system.time(x%*%x)"
   user  system elapsed 
 16.011   2.571   2.483 

Optimization beyond parallelism?

$ /usr/bin/Rscript -e "x=matrix(runif(1000^2),ncol=1000);system.time(x%*%x)"
  user  system elapsed 
 1.258   0.010   1.268

$ /usr/local/bin/Rscript -e "library(OpenBlasThreads);openblas.set.num.threads(1);x=matrix(runif(1000^2),ncol=1000);system.time(x%*%x)"
  user  system elapsed 
 0.123   0.004   0.127 


$ /usr/bin/Rscript -e "x=matrix(runif(5000^2),ncol=5000);system.time(x%*%x)"
   user  system elapsed 
134.687   0.055 134.791 

$ /usr/local/bin/Rscript -e "library(OpenBlasThreads);openblas.set.num.threads(1);x=matrix(runif(5000^2),ncol=5000);system.time(x%*%x)"
   user  system elapsed 
 14.197   0.057  14.250 

OpenBLAS DGEMM Performance

library(OpenBlasThreads)
x=matrix(runif(5000^2),ncol=5000)

sizes = c(100,500,1000,2000,3000,4000,5000)
cores = c(1,2,4,8)

sapply(cores, 
       function(c) 
       {
            openblas.set.num.threads(c)
            sapply(sizes, 
                   function(s) 
                   {
                        y = x[1:s,1:s]
                        system.time(y %*% y)[3]
                   })
       })

n 1 core 2 cores 4 cores 8 cores
100 0.001 0.001 0.000 0.000
500 0.018 0.011 0.008 0.008
1000 0.128 0.068 0.041 0.036
2000 0.930 0.491 0.276 0.162
3000 3.112 1.604 0.897 0.489
4000 7.330 3.732 1.973 1.188
5000 14.223 7.341 3.856 2.310

OpenBLAS DPOTRF Performance

library(OpenBlasThreads)
library(clusterGeneration)
x=genPositiveDefMat(5000)$Sigma

sizes = c(100,500,1000,2000,3000,4000,5000)
cores = c(1,2,4,8)

sapply(cores, 
       function(c) 
       {
            openblas.set.num.threads(c)
            sapply(sizes, 
                   function(s) 
                   {
                        y = x[1:s,1:s]
                        system.time(chol(y))[3]
                   })
       })

n 1 core 2 cores 4 cores 8 cores
100 0.000 0.001 0.000 0.000
500 0.006 0.004 0.003 0.002
1000 0.031 0.023 0.013 0.010
2000 0.211 0.135 0.074 0.049
3000 0.666 0.401 0.212 0.157
4000 1.530 0.903 0.460 0.283
5000 2.593 1.749 0.899 0.565


Base BLAS takes ~26 secs for a 5000 x 5000 matrix.