class: center, middle, inverse, title-slide # Lecture 8: Metropolis Algorithms and Diagnostics ### Merlise Clyde ### September 21 --- ## Last Class: Normal Means Model Data Model `$$Y_i \mid \mu_i, \sigma^2 \overset{iid}{\sim} \textsf(\mu_i, \sigma^2)$$` Means Model `$$\mu_i \mid \mu, \sigma^2_\mu \overset{iid}{\sim} \textsf(\mu, \sigma^2_{\mu})$$` Found marginal likelihood `\(\cal{L}(\mu, \sigma^2, \sigma^2_\mu)\)` by integrating out `\(\mu_i\)` with respect to `\(g\)` `$$\cal{L}(\mu, \sigma^2, \sigma^2_\mu) \propto \prod_{i = 1}^n (\sigma^2 + \sigma^2_\mu)^{-1/2} \exp \left\{ - \frac{1}{2} \frac{\left(y_i - \mu \right)^2}{\sigma^2 + \sigma^2_\mu }\right\}$$` Posterior for `\(\theta = \mu, \sigma^2_\mu\)` with `\(\sigma^2 = 1\)` `$$\pi(\theta \mid y) = \frac{\pi(\theta) \cal{L}(\theta)} {\int_\Theta \pi(\theta) \cal{L}(\theta) \, d\theta} = \frac{\pi(\theta) \cal{L}(\theta)} {m(y)}$$` --- ## Stochastic Sampling Intuition - From a sampling perspective, we need to have a large sample or group of values, `\(\theta^{(1)}, \ldots, \theta^{(S)}\)` from `\(\pi(\theta \mid y)\)` whose empirical distribution approximates `\(\pi(\theta \mid y)\)`. -- - for any two sets `\(A\)` and `\(B\)`, we want .block[ .small[ `$$\frac{\dfrac{\# \theta^{(s)} \in A}{S}}{\dfrac{\# \theta^{(s)} \in B}{S} } = \dfrac{\# \theta^{(s)} \in A}{\# \theta^{(s)} \in B} \approx \dfrac{\pi(\theta \in A \mid y)}{\pi(\theta \in B \mid y)}$$` ] ] -- - Suppose we have a working group `\(\theta^{(1)}, \ldots, \theta^{(s)}\)` at iteration `\(s\)`, and need to add a new value `\(\theta^{(s+1)}\)`. -- - Consider a candidate value `\(\theta^\star\)` that is _close_ to `\(\theta^{(s)}\)`. -- - Should we set `\(\theta^{(s+1)} = \theta^\star\)` or not? --- ## Metropolis Ratio look at the ratio $$ `\begin{split} M & = \dfrac{\pi(\theta^\star \mid y)}{\pi(\theta^{(s)} \mid y)} = \frac{\dfrac{p(y \mid \theta^\star) \pi(\theta^\star)}{p(y)} } {\dfrac{p(y \mid \theta^{(s)}) \pi(\theta^{(s)})}{p(y)}}\\ \\ & = \dfrac{p(y \mid \theta^\star) \pi(\theta^\star)}{p(y \mid \theta^{(s)}) \pi(\theta^{(s)})} \end{split}` $$ -- - does not depend on the marginal likelihood we don't know! --- ## Metropolis algorithm - If `\(M > 1\)` + Intuition: `\(\theta^{(s)}\)` is already a part of the density we desire and the density at `\(\theta^\star\)` is even higher than the density at `\(\theta^{(s)}\)`. + Action: set `\(\theta^{(s+1)} = \theta^\star\)` -- - If `\(M < 1\)`, + Intuition: relative frequency of values in our group `\(\theta^{(1)}, \ldots, \theta^{(s)}\)` "equal" to `\(\theta^\star\)` should be `\(\approx M = \dfrac{\pi(\theta^\star \mid y)}{\pi(\theta^{(s)} \mid y)}\)`. -- + For every `\(\theta^{(s)}\)`, include only a fraction of an instance of `\(\theta^\star\)`. -- + Action: set `\(\theta^{(s+1)} = \theta^\star\)` with probability `\(M\)` and `\(\theta^{(s+1)} = \theta^{(s)}\)` with probability `\(1-M\)`. --- ## Proposal Distribution - Where should the proposed value `\(\theta^\star\)` come from? -- - Sample `\(\theta^\star\)` close to the current value `\(\theta^{(s)}\)` using a **symmetric proposal distribution** `\(q(\theta^\star \mid \theta^{(s)})\)` -- - `\(q()\)` is actually a "family of proposal distributions", indexed by the specific value of `\(\theta^{(s)}\)`. -- - Here, symmetric means that `\(q(\theta^\star \mid \theta^{(s)}) = q(\theta^{(s)} \mid \theta^\star)\)`. -- - Common choice `$$\textsf{N}(\theta^\star; \theta^{(s)}, \delta^2 \Sigma)$$` with `\(\Sigma\)` based on the approximate `\(\textsf{Cov}(\theta \mid y)\)` and `\(\delta = 2.44/\text{dim}(\theta)\)` or tune -- `$$\text{Unif}(\theta^\star; \theta^{(s)} - \delta, \theta^{(s)} + \delta)$$` --- ## Metropolis Algorithm Recap - The algorithm proceeds as follows: 1. Given `\(\theta^{(1)}, \ldots, \theta^{(s)}\)`, generate a _candidate_ value `\(\theta^\star \sim q(\theta^\star \mid \theta^{(s)})\)`. -- 2. Compute the acceptance ratio .block[ .small[ $$ `\begin{split} M & = \dfrac{\pi(\theta^\star \mid y)}{\pi(\theta^{(s)} \mid y)} = \dfrac{p(y \mid \theta^\star) \pi(\theta^\star)}{p(y \mid \theta^{(s)}) \pi(\theta^{(s)})}. \end{split}` $$ ] ] -- 3. Set .block[ .small[ `\begin{eqnarray*} \theta^{(s+1)} = \left\{ \begin{array}{ll} \theta^\star & \quad \text{with probability} \quad \text{min}(M,1) \\ \theta^{(s)} & \quad \text{with probability} \quad 1 - \text{min}(M,1) \\ \end{array} \right. \end{eqnarray*}` ] ] equivalent to sampling `\(u \sim U(0,1)\)` independently and setting .block[ .small[ `\begin{eqnarray*} \theta^{(s+1)} = \left\{ \begin{array}{ll} \theta^\star & \quad \text{if} \quad u < M \\ \theta^{(s)} & \quad \text{if} \quad \text{otherwise} \\ \end{array} \right. . \end{eqnarray*}` ] ] --- ## Metropolis algorithm - Once we obtain the samples, then we are back to using Monte Carlo approximations for quantities of interest! -- - we can approximate posterior means, quantiles, and other quantities of interest using the empirical distribution of our sampled values. -- - easy to compute the posterior distribution of nonlinear functions of parameters! `$$\psi^{(s)} = g(\theta^{(s)})$$` -- - some posterior summaries are hard to calculate based on samples `\(\{ \theta^{(s)}\}\)` -- + mode/MAP (at least for continuous) -- + marginal likelihood `\(m(y) = \int \pi(\theta) p(y \mid \theta)\, d\theta\)` --- ## Notes - The Metropolis chain ALWAYS moves to the proposed `\(\theta^\star\)` at iteration `\(s+1\)` if `\(\theta^\star\)` has higher target density than the current `\(\theta^{(s)}\)`. -- - Sometimes, it also moves to a `\(\theta^\star\)` value with lower density in proportion to the density value itself. -- - This leads to a random, Markov process that naturally explores the space according to the probability defined by `\(\pi(\theta \mid y)\)`, and hence generates a sequence that, while dependent, eventually represents draws from `\(\pi(\theta \mid y)\)` (stationary distribution of the Markov Chain). --- ## Convergence We will not cover the convergence theory behind Metropolis chains in detail, but ... -- - The Markov process generated under this procedure is **ergodic** (irreducible and aperiodic) and has a unique limiting distribution (stationary distribution) -- + ergodicity means that the chain can move anywhere at each step, which is ensured, if `\(q(\theta^\star \mid \theta^{(s)}) > 0\)` everywhere! -- - By construction, Metropolis chains are **reversible**, so that `\(\pi(\theta \mid y)\)` is the stationary distribution -- + Think of reversibility as being equivalent to symmetry of the joint density of two consecutive `\(\theta^{(s)}\)` and `\(\theta^{(s+1)}\)` in the stationary process (which we get by using a symmetric proposal distribution) + detailed balance --- ## Example Priors with `\(\sigma^2 = 1\)`: `$$p(\mu) \propto 1$$` -- - Use a `\(\textsf{Cauchy}(0,1)\)` prior on `\(\sigma_\mu\)` independent of `\(\mu\)` and -- - Symmetric proposal for `\(\mu\)` and `\(\sigma_\tau\)`? -- - Try independent normals `\(\frac{2.44^2}{d} \textsf{Cov}(\theta)\)` where `\(d\)` is the dimension of `\(\theta\)` (d = 2) --- ## First 200 Samples <img src="08-metropolis-diagnostics_files/figure-html/joint-1.png" width="75%" style="display: block; margin: auto;" /> - Overall Acceptance probability is 0.6 out of 10^{4} samples -- - Goal is around 0.44 in 1 dimension to 0.23 in higher dimensions --- ## Tuning - Sampled values are correlated -- - Correlation between samples can be adjusted by selecting an optimal `\(\delta\)` (i.e., spread of the distribution) in the proposal distribution -- - `\(\delta\)` too small leads to `\(M \approx 1\)` for most proposed values, a high acceptance rate, but very small moves, leading to highly correlated chain. -- - `\(\delta\)` too large can get "stuck" because `\(\theta^\star\)` may be very far away from high density regions, leading to a very low acceptance rate and again high correlation in the Markov chain. -- - Burn-in and thinning can help! --- ## Burn-in - Convergence occurs regardless of our starting point (in theory), so we can usually pick any reasonable values in the parameter space as a starting point. -- - May take a long time to reach high density regions -- - Over representation of low density samples given finite iterations -- - Generally, we throw out a certain number of the first draws, known as the **burn-in**, as an attempt to make our draws closer to the stationary distribution and less dependent on any single set of starting values. -- - However, we don't know exactly when convergence occurs, so it is not always clear how much burn-in we would need. -- - If you run long enough you should not need to discard any samples! (ergodicity) --- ## Convergence diagnostics - Diagnostics available to help decide on number of burn-in & collected samples. -- - **Note**: no definitive tests of convergence but you should do as many diagnostics as you can, on all parameters in your model. -- - With "experience", visual inspection of trace plots perhaps most useful approach. -- - There are a number of useful automated tests in R. -- - **CAUTION**: diagnostics cannot guarantee that a chain has converged, but they can indicate it has not converged. --- ## Diagnostics in R - The most popular package for MCMC diagnostics in R is `coda`. -- - `coda` uses a special MCMC format so you must always convert your posterior matrix into an MCMC object. -- - For the example, we have the following in R. ```r #library(coda) theta.mcmc <- mcmc(theta,start=1) #no burn-in (simple problem!) ``` --- ## Diagnostics in R ```r summary(theta.mcmc) ``` ``` ## ## Iterations = 1:10000 ## Thinning interval = 1 ## Number of chains = 1 ## Sample size per chain = 10000 ## ## 1. Empirical mean and standard deviation for each variable, ## plus standard error of the mean: ## ## Mean SD Naive SE Time-series SE ## mu -0.07977 0.1046 0.001046 0.002839 ## sigma_mu 0.17550 0.1273 0.001273 0.004397 ## ## 2. Quantiles for each variable: ## ## 2.5% 25% 50% 75% 97.5% ## mu -0.283420 -0.1508 -0.08193 -0.00848 0.1337 ## sigma_mu 0.007995 0.0758 0.15024 0.25228 0.4693 ``` The naive SE is the **standard error of the mean**, which captures simulation error of the mean rather than the posterior uncertainty. The time-series SE adjusts the naive SE for **autocorrelation**. --- ## Effective sample size - The **effective sample size** translates the number of MCMC samples `\(S\)` into an equivalent number of independent samples. -- - It is defined as .block[ .small[ `$$\textrm{ESS} = \dfrac{S}{1 + 2 \sum_k \rho_k},$$` ] ] where `\(S\)` is the sample size and `\(\rho_k\)` is the lag `\(k\)` autocorrelation. -- - For our data, we have ```r effectiveSize(theta.mcmc) ``` ``` ## mu sigma_mu ## 1356.6495 838.2613 ``` -- - So our 10,000 samples are equivalent to 1356.6 independent samples for `\(\mu\)` and 838.3 independent samples for `\(\sigma_\mu\)`. --- ## Trace plot for mean ```r plot(theta.mcmc[,"mu"]) ``` <img src="08-metropolis-diagnostics_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## Trace plot for `\(\sigma_\mu\)` ```r plot(theta.mcmc[,"sigma_mu"]) ``` <img src="08-metropolis-diagnostics_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> OK (be careful of scaling in plots!) --- ## Autocorrelation - Another way to evaluate convergence is to look at the autocorrelation between draws of our Markov chain. -- - The lag `\(k\)` autocorrelation, `\(\rho_k\)`, is the correlation between each draw and its `\(k\)`th lag, defined as .block[ .small[ `$$\rho_k = \dfrac{\sum_{s=1}^{S-k}(\theta_s - \bar{\theta})(\theta_{s+k} - \bar{\theta})}{\sum_{s=1}^{S-k}(\theta_s - \bar{\theta})^2}.$$` ] ] -- - We expect the autocorrelation to decrease as `\(k\)` increases. -- - If autocorrelation remains high as `\(k\)` increases, we have slow mixing due to the inability of the sampler to move around the space well. --- ## Autocorrelation for mean ```r autocorr.plot(theta.mcmc[,"mu"]) ``` <img src="08-metropolis-diagnostics_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> So-So --- ## Autocorrelation for variance ```r autocorr.plot(theta.mcmc[,"sigma_mu"]) ``` <img src="08-metropolis-diagnostics_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> worse --- ## Gelman-Rubin Gelman & Rubin suggested a diagnostic `\(R\)` based on taking separate chains with dispersed initial values to test convergence -- <img src="08-metropolis-diagnostics_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ## Gelman-Rubin Diagnostic - Run m > 2 chains of length 2S from overdispersed starting values. - Discard the first S draws in each chain. - Calculate the pooled within-chain variance `\(W\)` and between-chain variance `\(B\)`. -- .block[ .small[ `$$R = \frac{\frac{S-1}{S} W + \frac{1}{S} B }{W}$$` ] ] -- - numerator and denominator are both unbiased estimates of the variance if the two chains have converged -- + otherwise `\(W\)` is an underestimate (hasn't explored enough) + numerator will overestimate as `\(B\)` is too large (overdispersed starting points) -- - As `\(S \to \infty\)` and `\(B \to 0\)`, `\(R \to 1\)` -- - version in `R` is slightly different --- ## Gelman-Rubin Diagnostic ```r theta.mcmc = mcmc.list(mcmc(theta1, start=5000), mcmc(theta2, start=5000)) gelman.diag(theta.mcmc) ``` ``` ## Potential scale reduction factors: ## ## Point est. Upper C.I. ## mu 1 1 ## sigma_mu 1 1 ## ## Multivariate psrf ## ## 1 ``` -- - Values of `\(R > 1.1\)` suggest lack of convergence -- - Looks OK -- See also `gelman.plot` --- ## Geweke statistic - Geweke proposed taking two non-overlapping parts of a single Markov chain (usually the first 10% and the last 50%) and comparing the mean of both parts, using a difference of means test -- - The null hypothesis would be that the two parts of the chain are from the same distribution. -- - The test statistic is a z-score with standard errors adjusted for autocorrelation, and if the p-value is significant for a variable, you need more draws. --- ## Geweke Diagnostic - The output is the z-score itself (not the p-value). ```r geweke.diag(theta.mcmc) ``` ``` ## [[1]] ## ## Fraction in 1st window = 0.1 ## Fraction in 2nd window = 0.5 ## ## mu sigma_mu ## -0.7779 0.7491 ## ## ## [[2]] ## ## Fraction in 1st window = 0.1 ## Fraction in 2nd window = 0.5 ## ## mu sigma_mu ## 0.4454 0.6377 ``` --- ## Practical advice on diagnostics - There are more tests we can use: Raftery and Lewis diagnostic, Heidelberger and Welch, etc. -- - The Gelman-Rubin approach is quite appealing in using multiple chains -- - Geweke (and Heidelberger and Welch) sometimes reject even when the trace plots look good. -- - Overly sensitive to minor departures from stationarity that do not impact inferences. -- - Most common method of assessing convergence is visual examination of trace plots. --- ## Improving - more iterations and multiple chains -- - thinning to reduce correlations and increase ESS -- - change the proposal distribution `\(q\)`