--- title: "Lecture 21" subtitle: "Computational Methods for GPs" author: "Colin Rundel" date: "04/10/2017" fontsize: 11pt output: beamer_presentation: theme: metropolis highlight: pygments fig_caption: false latex_engine: xelatex keep_tex: true includes: in_header: ../settings.tex --- ```{r setup, include=FALSE} library(raster) library(dplyr) library(ggplot2) library(patchwork) library(sf) library(rstan) set.seed(20180405) knitr::opts_chunk$set( collapse = TRUE, fig.width=7, fig.height=4.5, out.width="\\textwidth", fig.align="center", echo=TRUE, warning=FALSE ) ggplot2::theme_set(ggplot2::theme_bw()) rstan::rstan_options(auto_write = TRUE) options(mc.cores = parallel::detectCores()) source("../util.R") ``` # GPs and Computational Complexity ## The problem with GPs Unless you are lucky (or clever), Gaussian process models are difficult to scale to large problems. For a Gaussian process $\symbf{y} \sim \mathcal{N}(\symbf{\mu},\symbf{\Sigma})$: . . . \vspace{3mm} Want to sample $\symbf{y}$? \[ \symbf{\mu} + \hlr{\text{Chol}(\symbf{\Sigma})} \times \symbf{Z} \text{ with } Z_i \sim \mathcal{N}(0,1) \qquad \qquad \color{redhl}{\mathcal{O}\left(n^3\right)} \] . . . Evaluate the (log) likelihood? \[ -\frac{1}{2} \log \hlr{|\Sigma|} - \frac{1}{2} (\symbf{x}-\symbf{\mu})' \hlr{\symbf{\Sigma}^{-1}} (\symbf{x}-\symbf{\mu}) - \frac{n}{2}\log 2\pi \qquad \qquad \color{redhl}{\mathcal{O}\left(n^3\right)}\] . . . Update covariance parameter? \[ \hly{\{\Sigma\}_{ij}} = \sigma^2 \exp(-\{d\}_{ij}\phi) + \sigma^2_n \, 1_{i=j} \qquad \qquad \color{yellowhl}{\mathcal{O}\left(n^2\right)}\] ## A simple guide to computational complexity \Large \begin{center} \vfill $\mathcal{O}\left(n\right)$ - Linear complexity \pause- Go for it \pause \vspace{15mm} $\mathcal{O}\left(n^2\right)$ - Quadratic complexity \pause- Pray \pause \vspace{15mm} $\mathcal{O}\left(n^3\right)$ - Cubic complexity \pause- Give up \vfill \end{center} ## How bad is the problem? ```{r echo=FALSE, message=FALSE} decomp = readr::read_csv("data/lapack.csv") ggplot(decomp, aes(y=cpu, x=n, color=method)) + geom_line() + geom_point() + ylab("time (secs)") ``` ## Practice - Migratory Model Prediction After fitting the GP need to sample from the posterior predictive distribution at $\sim3000$ locations $$ \symbf{y}_{p} \sim \mathcal{N}\left(\mu_p + \Sigma_{po} \Sigma_o^{-1}(y_o - \mu_o) ,~ \Sigma_p - \Sigma_{po} \Sigma_{o}^{-1} \Sigma_{op}\right) $$ . . . \scriptsize | Step | CPU (secs) | |--------------------|------------| | 1. Calc. $\Sigma_p$, $\Sigma_{po}$, $\Sigma_{p}$ | 1.080 | | 2. Calc. $\text{chol}(\Sigma_p - \Sigma_{po} \Sigma_{o}^{-1} \Sigma_{op})$ | 0.467 | | 3. Calc. $\mu_{p|o} + \text{chol}(\Sigma_{p|o}) \times Z$ | 0.049 | | 4. Calc. Allele Prob | 0.129 | | Total | 1.732 | \normalsize Total run time for 1000 posterior predictive draws: * CPU (28.9 min) ## A bigger hammer? \scriptsize | Step | CPU (secs) | CPU+GPU (secs) | Rel. Perf | |----------------------------------|------------|----------------|-----------| | 1. Calc. $\Sigma_p$, $\Sigma_{po}$, $\Sigma_{p}$ | 1.080 | 0.046 | 23.0 | | 2. Calc. $\text{chol}(\Sigma_p - \Sigma_{po} \Sigma_{o}^{-1} \Sigma_{op})$ | 0.467 | 0.208 | 2.3 | | 3. Calc. $\mu_{p|o} + \text{chol}(\Sigma_{p|o}) \times Z$ | 0.049 | 0.052 | 0.9 | | 4. Calc. Allele Prob | 0.129 | 0.127 | 1.0 | | Total | 1.732 | 0.465 | 3.7 | \normalsize Total run time for 1000 posterior predictive draws: * CPU (28.9 min) * CPU+GPU (7.8 min) ## Cholesky CPU vs GPU (P100) ```{r echo=FALSE} decomp %>% tidyr::gather(comp, time, cpu:gpu) %>% ggplot(aes(y=time, x=n, color=method, linetype=comp)) + geom_line() + geom_point() + ylab("time (secs)") ``` ## ```{r echo=FALSE} decomp %>% tidyr::gather(comp, time, cpu:gpu) %>% ggplot(aes(y=time, x=n, color=method, linetype=comp)) + geom_line() + geom_point() + ylab("time (secs)") + scale_y_log10() ``` ## Relative Performance ```{r echo=FALSE} decomp %>% ggplot(aes(y=cpu/gpu, x=n, color=method)) + geom_line() + geom_point() + ylab("Relative performance") + scale_y_log10() ``` ## Aside (1) - Matrix Multiplication ```{r echo=FALSE, message=FALSE} mat_mult = readr::read_csv("data/mat_mult.csv") mat_mult %>% tidyr::gather(comp, time, -n) %>% ggplot(aes(x=n, y=time/1000, linetype=comp)) + geom_line() + geom_point() + ylab("time (sec)") + labs(title="Matrix Multiplication") ``` ## ```{r echo=FALSE, message=FALSE} mat_mult %>% ggplot(aes(x=n, y=cpu/gpu)) + geom_line() + geom_point() + ylab("time (sec)") + labs(title="Matrix Multiplication - Relative Performance") ``` ## Aside (2) - Memory Limitations A general covariance is a dense $n \times n$ matrix, meaning it will require $n^2 \times$ 64-bits to store. ```{r echo=FALSE} size = data_frame( n = seq(1000, 50000, by=1000) ) %>% mutate(size_gb = n^2 * 64/8 / 10^9) ggplot(size, aes(x=n, y=size_gb)) + geom_line() + geom_hline(yintercept = c(4,8,16), color="red") + labs(y="Cov Martrix Size (GB)") ``` ## Other big hammers \small `bigGP` is an R package written by Chris Paciorek (UC Berkeley), et al. * Specialized distributed implementation of linear algebra operation for GPs * Designed to run on large super computer clusters * Uses both shared and distributed memory * Able to fit models on the order of $n = 65$k (32 GB Cov. matrix) \vspace{-3mm} \begin{center} \includegraphics[width=0.7\textwidth]{figs/Paciorek.pdf} \end{center} ## More scalable solutions? \large * Spectral domain / basis functions \vspace{3mm} * Covariance tapering \vspace{3mm} * GMRF approximations \vspace{3mm} * Low-rank approximations \vspace{3mm} * Nearest-neighbor models # Low Rank Approximations ## Low rank approximations in general Lets look at the example of the singular value decomposition of a matrix, $$ \underset{n \times m}{M} = \underset{n \times n}{U}\,\underset{n \times m}{\text{diag}(S)}\,\underset{m \times m}{V^{\,t}} $$ where $U$ are called the left singular vectors, $V$ the right singular vectors, and $S$ the singular values. Usually the singular values and vectors are ordered such that the singular values are in descending order. . . . The Eckart–Young theorem states that we can construct an approximatation of $M$ with rank $k$ by setting $\tilde S$ to contain only the $k$ largest singular values and all other values set to zero. $$ \begin{aligned} \underset{n \times m}{\tilde M} &= \underset{n \times n}{U}\,\underset{n \times m}{\text{diag}(\tilde S)}\,\underset{m \times m}{V^{\,t}} \\ &= \underset{n \times k}{\tilde U}\,\underset{k \times k}{\text{diag}(\tilde S)}\,\underset{k \times m}{\tilde{V}^{\,t}} \end{aligned} $$ ## Example ```{r echo=FALSE, message=FALSE, eval=FALSE} print_mat = function(m, digits=3) { print( xtable::xtable(m, digits=digits), floating=FALSE, tabular.environment="pmatrix", hline.after=NULL, include.rownames=FALSE, include.colnames=FALSE ) } hilbert = function(n) { i <- 1:n; 1 / outer(i - 1, i, "+") } M = hilbert(4) s = svd(M) D = diag(s$d) svd(M) print_mat(M) print_mat(s$u) print_mat(D) print_mat(t(s$v)) D_tilde = D = diag(c(s$d[1:2],0,0)) M_tilde = s$u %*% D_tilde %*% t(s$v) print_mat(M_tilde) ``` \footnotesize $$ \begin{aligned} M &= \begin{pmatrix} 1.000 & 0.500 & 0.333 & 0.250 \\ 0.500 & 0.333 & 0.250 & 0.200 \\ 0.333 & 0.250 & 0.200 & 0.167 \\ 0.250 & 0.200 & 0.167 & 0.143 \\ \end{pmatrix} = U \, \text{diag}(S) \, V^{\,t} \\ U = V &= \begin{pmatrix} -0.79 & 0.58 & -0.18 & -0.03 \\ -0.45 & -0.37 & 0.74 & 0.33 \\ -0.32 & -0.51 & -0.10 & -0.79 \\ -0.25 & -0.51 & -0.64 & 0.51 \\ \end{pmatrix} \\ S &= \begin{pmatrix} 1.50 & 0.17 & 0.01 & 0.00 \end{pmatrix} \end{aligned} $$ . . . \normalsize Rank 2 approximation: \footnotesize $$ \begin{aligned} \tilde M &= \begin{pmatrix} -0.79 & 0.58 \\ -0.45 & -0.37 \\ -0.32 & -0.51 \\ -0.25 & -0.51 \\ \end{pmatrix} \begin{pmatrix} 1.50 & 0.00 \\ 0.00 & 0.17 \\ \end{pmatrix} \begin{pmatrix} -0.79 & -0.45 & -0.32 & -0.25 \\ 0.58 & -0.37 & -0.51 & -0.51 \\ \end{pmatrix} \\ &= \begin{pmatrix} 1.000 & 0.501 & 0.333 & 0.249 \\ 0.501 & 0.330 & 0.251 & 0.203 \\ 0.333 & 0.251 & 0.200 & 0.166 \\ 0.249 & 0.203 & 0.166 & 0.140 \\ \end{pmatrix} \end{aligned} $$ ## Approximation Error We can measure the error of the approximation using the Frobenius norm, $$ \lVert M-\tilde M\rVert_F = \left( \sum_{i=1}^m\sum_{j=1}^n (M_{ij}-\tilde M_{ij})^2\right)^{1/2} $$ . . . $$ M-\tilde M = \begin{pmatrix} 0.00022 & -0.00090 & 0.00012 & 0.00077 \\ -0.00090 & 0.00372 & -0.00053 & -0.00317 \\ 0.00012 & -0.00053 & 0.00013 & 0.00039 \\ 0.00077 & -0.00317 & 0.00039 & 0.00277 \\ \end{pmatrix} $$ $$ \lVert M-\tilde M\rVert_F = 0.00674 $$ ## Cov Mat - Strong dependence (large eff. range): ```{r echo=FALSE, fig.height=4} d = runif(2*50) %>% matrix(ncol=2) %>% dist() %>% as.matrix() cov = exp(-d * 3) svd_m = svd(cov) sing_values = svd_m$d res = data_frame(rank=50:0, frob=NA) for(i in 1:nrow(res)) { res$frob[i] = (cov - svd_m$u %*% diag(svd_m$d) %*% t(svd_m$v))^2 %>% sum() %>% sqrt() svd_m$d[50-i+1] = 0 } par(mfrow=c(1,2)) plot(sing_values, type='b', xlab="", ylab="Singular Values", main="SVD") plot(res$rank, res$frob, type='b', xlab="Rank", ylab="Error (Frob. norm)", main="Low Rank SVD") ``` ## Cov Mat - Weak dependence (short eff. range): ```{r echo=FALSE, fig.height=4} d = runif(2*50) %>% matrix(ncol=2) %>% dist() %>% as.matrix() cov = exp(-d * 8) svd_m = svd(cov) sing_values = svd_m$d res = data_frame(rank=50:0, frob=NA) for(i in 1:nrow(res)) { res$frob[i] = (cov - svd_m$u %*% diag(svd_m$d) %*% t(svd_m$v))^2 %>% sum() %>% sqrt() svd_m$d[50-i+1] = 0 } par(mfrow=c(1,2)) plot(sing_values, type='b', xlab="", ylab="Singular Values", main="SVD") plot(res$rank, res$frob, type='b', xlab="Rank", ylab="Error (Frob. norm)", main="Low Rank SVD") ``` ## How does this help? (Sherman-Morrison-Woodbury) {.t} There is an immensely useful linear algebra identity, the Sherman-Morrison-*Woodbury* formula, for the inverse (and determinant) of a decomposed matrix, $$\begin{aligned} \underset{n \times m}{\tilde M}^{-1} &= \left(\underset{n \times m}{A} + \underset{n \times k}{U} ~ \underset{k \times k}{S} ~ \underset{k \times m}{V^t}\right)^{-1} \\ &= A^{-1} - A^{-1} U \left(S^{-1}+V^{\,t} A^{-1} U\right)^{-1}V^{\,t} A^{-1}. \end{aligned}$$ . . . How does this help? * Imagine that $A = \text{diag}(A)$, then it is trivial to find $A^{-1}$. * $S^{-1}$ is $k \times k$ which is hopefully small, or even better $S = \text{diag}(S)$. * $\left(S^{-1}+V^{\,t} A^{-1} U\right)$ is $k \times k$ which is also hopefully small. ## Aside - Determinant Remember for any MVN distribution when evaluating the likelihood $$ -\frac{1}{2} \log {|\Sigma|} - \frac{1}{2} (\symbf{x}-\symbf{\mu})' {\symbf{\Sigma}^{-1}} (\symbf{x}-\symbf{\mu}) - \frac{n}{2}\log 2\pi$$ we need the inverse of $\Sigma$ as well as its *determinant*. . . . * For a full rank Cholesky decomposition we get the determinant for ``free''. \vspace{-3mm} $$|M| = |LL^t| = \prod_{i=1}^n \left(\text{diag}(L)_i\right)^2$$ . . . * For a low rank approximation the Sherman-Morrison-Woodbury Determinant lemma gives us, \vspace{-3mm} $$\begin{aligned} \det(\tilde M) &= \det({A} + {U} {S} {V^t}) \\ &= \det(S^{-1} + V^t A^{-1} U) ~ \det(S) ~ \det(A) \end{aligned}$$ ## Low rank approximations for GPs {.t} For a standard spatial random effects model, \[ y(\symbf{s}) = x(\symbf{s}) \, \symbf{\beta} + w(\symbf{s}) + \epsilon, \quad \epsilon \sim N(0,~\tau^2 I) \] \[ w(\symbf{s}) \sim \mathcal{N}(0,~\symbf{\Sigma}(\symbf{s})), \quad \symbf{\Sigma}(\symbf{s},\symbf{s}')=\sigma\,\rho(\symbf{s},\symbf{s}'|\theta) \] if we can replace $\symbf{\Sigma}(\symbf{s})$ with a low rank approximation of the form * $\symbf{\Sigma}(\symbf{s}) \approx \symbf{U}\,\symbf{S}\,\symbf{V}^t$ where * $\symbf{U}$ and $\symbf{V}$ are $n \times k$, * $\symbf{S}$ is $k \times k$, and * $A = \tau^2 I$ or a similar diagonal matrix # Predictive Processes ## Gaussian Predictive Processes \small For a rank $k$ approximation, * Pick $k$ knot locations $\symbf{s}^\star$ . . . * Calculate knot covariance, $\symbf{\Sigma}(\symbf{s}^\star)$, and knot cross-covariance, $\symbf{\Sigma}(\symbf{s}, \symbf{s}^\star)$ . . . * Approximate full covariance using \vspace{-2mm} $$ \symbf{\Sigma}(\symbf{s}) \approx \underset{n \times k}{\symbf{\Sigma}(\symbf{s},\symbf{s}^\star)} \, \underset{k \times k}{\symbf{\Sigma}(\symbf{s}^\star)^{-1}} \, \underset{k \times n}{\symbf{\Sigma}(\symbf{s}^\star,\symbf{s})}. $$ . . . * PPs systematically underestimates variance ($\sigma^2$) and inflate $\tau^2$, Modified predictive processs corrects this using \vspace{-2mm} $$ \begin{aligned} \symbf{\Sigma}(\symbf{s}) \approx & \symbf{\Sigma}(\symbf{s},\symbf{s}^\star) \, \symbf{\Sigma}(\symbf{s}^\star)^{-1} \, \symbf{\Sigma}(\symbf{s}^\star,\symbf{s}) \\ &+ \text{diag}\Big(\symbf{\Sigma}(\symbf{s}) - \symbf{\Sigma}(\symbf{s},\symbf{s}^\star) \, \symbf{\Sigma}(\symbf{s}^\star)^{-1} \, \symbf{\Sigma}(\symbf{s}^\star,\symbf{s})\Big). \end{aligned} $$ \vspace{4mm} \footnotesize \begin{center} Banerjee, Gelfand, Finley, Sang (2008) \quad Finley, Sang, Banerjee, Gelfand (2008) \end{center} ## Example Below we have a surface generate from a squared exponential Gaussian Process where $$ \{\Sigma\}_{ij} = \sigma^2 \exp\left(-(\phi\,d)^2\right) + \tau^2 I $$ $$ \sigma^2 = 1 \quad \phi=9 \quad \tau^2 = 0.1 $$ ```{r echo=FALSE} set.seed(20170410) if(!file.exists("pp_data.Rdata")) { n=4900 n_samp = 1000 r = raster::raster(xmn=0, xmx=1, ymn=0, ymx=1, nrow=sqrt(n), ncol=sqrt(n)) coords = raster::xyFromCell(r, 1:length(r)) cov_func = function(d) exp(-(9*d)^2) + ifelse(d==0, 0.1, 0) Sigma = coords %>% dist() %>% as.matrix() %>% cov_func() r[] = t(chol(Sigma)) %*% rnorm(n) obs_coords = runif(2*n_samp) %>% matrix(ncol=2) Sigma_2 = obs_coords %>% rdist() %>% cov_func() Sigma_21 = rdist(obs_coords, coords) %>% cov_func() obs = Sigma_21 %*% solve(Sigma) %*% r[] + t(chol(Sigma_2 - Sigma_21 %*% solve(Sigma, t(Sigma_21)))) %*% rnorm(n_samp) d = data_frame(z=c(obs), x=obs_coords[,1], y=obs_coords[,2]) r_obs = r r_obs[] = NA r_obs[raster::cellFromXY(r_obs, d[,c("x","y")] %>% as.matrix())] = d$z save(d, r, coords, r_obs, cov_func, n, n_samp, file = "pp_data.Rdata") } else { load("pp_data.Rdata") } ``` ```{r echo=FALSE, fig.height=4.5} par(mfrow=c(1,2)) plot(r, main="True Surface", axes=FALSE, asp=0) plot(r_obs, main="Observed Data", axes=FALSE, asp=0) ``` ## Predictive Process Model Results ```{r echo=FALSE, message=FALSE} library(spBayes) if (!file.exists("pp_models.Rdata")) { n.samples = 20000 starting = list("phi"=3/0.3, "sigma.sq"=1, "tau.sq"=0.1) tuning = list("phi"=0.1, "sigma.sq"=0.1, "tau.sq"=0.1) priors = list("beta.Norm"=list(0, 100), "phi.Unif"=c(3/1, 3/0.1), "sigma.sq.IG"=c(2, 2), "tau.sq.IG"=c(2, 2)) cov.model = "gaussian" m = spLM(z~1, data=d, coords=d[,c("x","y")] %>% as.matrix(), starting=starting, tuning=tuning, priors=priors, cov.model=cov.model, n.samples=n.samples, verbose=TRUE, n.report=n.samples/2+1) pp_5 = spLM(z~1, data=d, coords=d[,c("x","y")] %>% as.matrix(), knots=c(5,5,0.1), modified.pp = FALSE, starting=starting, tuning=tuning, priors=priors, cov.model=cov.model, n.samples=n.samples, verbose=FALSE, n.report=n.samples/2+1) pp_10 = spLM(z~1, data=d, coords=d[,c("x","y")] %>% as.matrix(), knots=c(10,10,0.05), modified.pp = FALSE, starting=starting, tuning=tuning, priors=priors, cov.model=cov.model, n.samples=n.samples, verbose=FALSE, n.report=n.samples/2+1) pp_15 = spLM(z~1, data=d, coords=d[,c("x","y")] %>% as.matrix(), knots=c(15,15,0.05), modified.pp = FALSE, starting=starting, tuning=tuning, priors=priors, cov.model=cov.model, n.samples=n.samples, verbose=FALSE, n.report=n.samples/2+1) mpp_5 = spLM(z~1, data=d, coords=d[,c("x","y")] %>% as.matrix(), knots=c(5,5,0.1), modified.pp = TRUE, starting=starting, tuning=tuning, priors=priors, cov.model=cov.model, n.samples=n.samples, verbose=FALSE, n.report=n.samples/2+1) mpp_10 = spLM(z~1, data=d, coords=d[,c("x","y")] %>% as.matrix(), knots=c(10,10,0.05), modified.pp = TRUE, starting=starting, tuning=tuning, priors=priors, cov.model=cov.model, n.samples=n.samples, verbose=FALSE, n.report=n.samples/2+1) mpp_15 = spLM(z~1, data=d, coords=d[,c("x","y")] %>% as.matrix(), knots=c(15,15,0.05), modified.pp = TRUE, starting=starting, tuning=tuning, priors=priors, cov.model=cov.model, n.samples=n.samples, verbose=FALSE, n.report=n.samples/2+1) models = list(m=m, pp_5=pp_5, pp_10=pp_10, pp_15=pp_15, mpp_5=mpp_5, mpp_10=mpp_10, mpp_15=mpp_15) save(models, file="pp_models.Rdata") } else { load("pp_models.Rdata") } ``` ```{r echo=FALSE} if (!file.exists("pp_predict.Rdata")) { predictions = lapply( models, function(m) { pred = spPredict(m, coords, matrix(1, nrow=nrow(coords)), start = 10001, thin=100, n.report=10) rast = r rast[] = pred$p.y.predictive.samples %>% t() %>% post_summary() %>% .$post_mean rast } ) %>% setNames(names(models)) save(predictions, file="pp_predict.Rdata") } else { load("pp_predict.Rdata") } ``` ```{r echo=FALSE} par(mfrow=c(2,4), mar=c(1,1,3,1)) plot(r, axes=FALSE, asp=0, legend=FALSE, main="True Field") plot(predictions$pp_5, axes=FALSE, asp=0, legend=FALSE, main="PP - 5 x 5 knots") plot(predictions$pp_10, axes=FALSE, asp=0, legend=FALSE, main="PP - 10 x 10 knots") plot(predictions$pp_15, axes=FALSE, asp=0, legend=FALSE, main="PP - 15 x 15 knots") plot(predictions$m, axes=FALSE, asp=0, legend=FALSE, main="Full GP") plot(predictions$mpp_5, axes=FALSE, asp=0, legend=FALSE, main="Mod. PP - 5 x 5 knots") plot(predictions$mpp_10, axes=FALSE, asp=0, legend=FALSE, main="Mod. PP - 10 x 10 knots") plot(predictions$mpp_15, axes=FALSE, asp=0, legend=FALSE, main="Mod. PP - 15 x 15 knots") ``` ## Performance ```{r echo=FALSE} res = data_frame( time = purrr::map_dbl(models, ~ .$run.time[3]), error = purrr::map_dbl(predictions, ~ (.[] - r[])^2 %>% sum() %>% sqrt()), model = c("Full GP", "PP", "PP","PP", "Mod. PP", "Mod. PP", "Mod. PP") %>% forcats::as_factor(), knots = c("-",25, 100, 225,25, 100, 225) %>% forcats::as_factor() ) ggplot(res, aes(x=time, y=error, color=knots, shape=model)) + geom_point(size=3) ``` ## Parameter Estimates ```{r echo=FALSE, fig.height=3} post_mean = purrr::map(models, ~ .$p.theta.samples %>% post_summary() %>% select(param, post_mean) %>% tibble::deframe()) %>% purrr::transpose() %>% purrr::simplify_all() %>% bind_cols() %>% cbind(model = names(models),.) %>% mutate(model = as.character(model)) %>% rbind(list("true",1,0.1,9)) %>% mutate( model = c("Full GP", "PP", "PP","PP", "Mod. PP", "Mod. PP", "Mod. PP","True") %>% forcats::as_factor(), knots = c("-",25, 100, 225,25, 100, 225,"-") %>% forcats::as_factor() ) %>% tidyr::gather(param, value, -model, -knots) ggplot(post_mean, aes(x=value, y=as.integer(forcats::as_factor(model)), col=knots, shape=model)) + geom_point(size=3) + facet_wrap(~param, scale="free_x") + labs(x="Parameter Value", y="") + theme(axis.title.y=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank()) ``` # Random Projections ## Low Rank Approximations via Random Projections 1. Starting with an matrix $\underset{m \times n}{\symbf{A}}$. . . . 2. Draw a Gaussian random matrix $\underset{n \times k+p}{\symbf{\Omega}}$. . . . 3. Form $\symbf{Y} = \symbf{A}\,\symbf{\Omega}$ and compute its QR factorization $\symbf{Y} = \symbf{Q}\,\symbf{R}$ . . . 4. Form $\symbf{B}=\symbf{Q}'\,\symbf{A}$. . . . 5. Compute the SVD of $\symbf{B} = \symbf{\hat{U}}\,\symbf{S}\,\symbf{V}'$. . . . 6. Form the matrix $\symbf{U} = \symbf{Q} \, \symbf{\hat{U}}$. . . . 7. Form $\symbf{\tilde{A}} = \symbf{U}\symbf{S}\symbf{V}'$ \vspace{2mm} Resulting approximation has a bounded expected error, \[ E| \symbf{A} - \symbf{U}\symbf{S}\symbf{V}'\|_F \leq \left[1 + \frac{4\sqrt{k+p}}{p-1} \sqrt{\min(m,n)} \right] \sigma_{k+1}. \] \vvfill \footnotesize \begin{center} Halko, Martinsson, Tropp (2011) \end{center} ## Random Matrix Low Rank Approximations and GPs PreceThe pding algorithm can be modified slightly to take advantage of the positive definite structure of a covariance matrix. 1. Starting with an $n \times n$ covariance matrix $\symbf{A}$. . . . 2. Draw Gaussian random matrix $\underset{n \times k+p}{\symbf{\Omega}}$. . . . 3. Form $\symbf{Y} = \symbf{A}\,\symbf{\Omega}$ and compute its QR factorization $\symbf{Y} = \symbf{Q}\,\symbf{R}$ . . . 4. Form the $\symbf{B}=\symbf{Q}'\,\symbf{A} \, \symbf{Q}$. . . . 5. Compute the eigen decomposition of $\symbf{B} = \symbf{\hat{U}}\,\symbf{S}\,\symbf{\hat{U}}'$. . . . 6. Form the matrix $\symbf{U} = \symbf{Q} \, \symbf{\hat{U}}$. . . . Once again we have a bound on the error, \[ E \|\symbf{A} - \symbf{U}\symbf{S}\symbf{U}'\|_F \lesssim c \cdot \sigma_{k+1}. \] \vvfill \footnotesize \begin{center} Halko, Martinsson, Tropp (2011), Banerjee, Dunson, Tokdar (2012) \end{center} ## Low Rank Approximations and GPUs Both predictive process and random matrix low rank approximations are good candidates for acceleration using GPUs. \vspace{3mm} * Both use Sherman-Woodbury-Morrison to calculate the inverse (involves matrix multiplication, addition, and a small matrix inverse). \vspace{3mm} * Predictive processes involves several covariance matrix calculations (knots and cross-covariance) and a small matrix inverse. \vspace{3mm} * Random matrix low rank approximations involves a large matrix multiplication ($\symbf{A}\,\symbf{\Omega}$) and several small matrix decompositions (QR, eigen). ## Comparison ($n=15,000$, $k=\{100,\ldots,4900\}$) ```{r fig.height=4, echo=FALSE} load("data/res3.Rdata") strong_cpu = res$time[res$method=="cpu"] strong_gpu = res$time[res$method=="gpu"] strong = res %>% filter(!method %in% c("cpu","gpu")) %>% filter(!method %in% c("lr2", "lr2 mod", "lr3", "lr3 mod")) load("data/res12.Rdata") weak_cpu = res$time[res$method=="cpu"] weak_gpu = res$time[res$method=="gpu"] weak = res %>% filter(!method %in% c("cpu","gpu")) %>% filter(!method %in% c("lr2", "lr2 mod", "lr3", "lr3 mod")) ( ggplot(strong, aes(x=time, y=error, color=method)) + geom_line() + geom_point() + geom_vline(xintercept = strong_cpu, color="red") + geom_vline(xintercept = strong_gpu, color="orange") + labs(title="Strong Dependence") ) + ( ggplot(weak, aes(x=time, y=error, color=method)) + geom_line() + geom_point() + geom_vline(xintercept = weak_cpu, color="red") + geom_vline(xintercept = weak_gpu, color="orange") + labs(title="Weak Dependence") ) ``` ## Rand. Projection LR Depositions for Prediction {.t} \small This approach can also be used for prediction, if we want to sample $$\symbf{y} \sim \mathcal{N}(0,\symbf{\Sigma})$$ $$ \Sigma \approx \symbf{U} \symbf{S} \symbf{U}^t = (\symbf{U} \symbf{S}^{1/2} \symbf{U}^t)(\symbf{U} \symbf{S}^{1/2} \symbf{U}^t)^t $$ then $$ y_{\text{pred}} = (\symbf{U}\, \symbf{S}^{1/2}\,\symbf{U}^t) \times \symbf{Z} \text{ where } Z_i \sim \mathcal{N}(0,1) $$ because $\symbf{U}^t \, \symbf{U} = I$ since $\symbf{U}$ is an orthogonal matrix. \vvfill \begin{center} \footnotesize Dehdari, Deutsch (2012) \end{center} ## \vspace{5mm} \begin{center} \includegraphics[width=\textwidth]{figs/RandLRPred.png} \end{center} $$ n=1000, \quad p=10000 $$