Lecture 9: More MCMC: Adaptive Metropolis, Metropolis-Hastings, and Gibbs

class: center, middle, inverse, title-slide

# Lecture 9: More MCMC: Adaptive Metropolis, Metropolis-Hastings, and Gibbs
### Merlise Clyde
### September 23

---

## Example from Last Class

Priors with `$\sigma^2 = 1$`:

`$$p(\mu) \propto 1$$`
--

- Use a `$\textsf{Cauchy}(0,1)$` prior on `$\sigma_\mu$` independent of `$\mu$` and

- Symmetric proposal for `$\mu$` and  `$\sigma_\tau$`

-  Independent normals centered at current values of `$\mu$` and `$\sigma_\mu$`  with covariance `$\frac{2.4^2}{d} \textsf{Cov}(\theta)$` where `$d = 2$` (the dimension of `$\theta$` )

---
## Gelman-Rubin

Gelman  & Rubin suggested a diagnostic `$R$` based on taking separate  chains with dispersed initial values to test convergence

--
<img src="09-adaptive-metropolis_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" />

---
## Gelman-Rubin Diagnostic

- Run m > 2 chains of length 2S from overdispersed starting values.
- Discard the first S draws in each chain.
-  Calculate the pooled within-chain variance `$W$` and between-chain variance `$B$`.

.block[
.small[
`$$R = \frac{\frac{S-1}{S} W + \frac{1}{S} B }{W}$$`
]
]

- numerator and denominator are both unbiased estimates of the variance if the two chains have converged

+  otherwise `$W$` is an underestimate (hasn't explored enough)
  
--
  +  numerator will overestimate as `$B$` is too large (overdispersed starting points)
  
--

- As `$S \to \infty$` and `$B \to 0$`,  `$R \to 1$`

- Note: version in `R` is slightly different

---
## Gelman-Rubin Diagnostic

```r
theta.mcmc = mcmc.list(mcmc(theta1, start=5000), mcmc(theta2, start=5000))
gelman.diag(theta.mcmc)
```

```
## Potential scale reduction factors:
## 
##          Point est. Upper C.I.
## mu                1          1
## sigma_mu          1          1
## 
## Multivariate psrf
## 
## 1
```

-  Values of `$R > 1.1$` suggest lack of convergence

--
- Looks OK
 
--

See also `gelman.plot`

---
## Geweke statistic

- Geweke proposed taking two non-overlapping parts of a single Markov chain (usually the first 10% and the last 50%) and comparing the mean of both parts, using a difference of means test

- The null hypothesis would be that the two parts of the chain are from the same distribution.

- The test statistic is a z-score with standard errors adjusted for autocorrelation, and if the p-value is significant for a variable, you need more draws.

---
## Geweke Diagnostic

- The output is the z-score itself (not the p-value).

```r
geweke.diag(theta.mcmc)
```

```
## [[1]]
## 
## Fraction in 1st window = 0.1
## Fraction in 2nd window = 0.5 
## 
##       mu sigma_mu 
##  -0.7779   0.7491 
## 
## 
## [[2]]
## 
## Fraction in 1st window = 0.1
## Fraction in 2nd window = 0.5 
## 
##       mu sigma_mu 
##   0.4454   0.6377
```

---
## Practical advice on diagnostics

- There are more tests we can use: Raftery and Lewis diagnostic, Heidelberger and Welch, etc.

- The Gelman-Rubin approach is quite appealing in using multiple chains

- Geweke (and Heidelberger and Welch) sometimes reject even when the trace plots look good.

- Overly sensitive to minor departures from stationarity that do not impact inferences.

- Most common method of assessing convergence is visual examination of trace plots.

---
##  Improving

-  more iterations and multiple chains

- thinning to reduce correlations and increase ESS

- change the proposal distribution `$q$`

---
## Proposal Distribution

Common choice 
`$$\textsf{N}(\theta^\star; \theta^{(s)}, \delta^2 \Sigma)$$`
--

-  rough estimate of `$\Sigma$` based on the asymptotic Gaussian approximation `$\textsf{Cov}(\theta \mid y)$` and `$\delta = 2.38/\sqrt{\text{dim}(\theta)}$`

--
  + find the MAP estimate (posterior mode)  `$\hat{\theta}$`
  
--
  + take 
  $$\Sigma =  \left[- 
  \frac{\partial^2 \log(\cal{L}(\theta)) + \log(\pi(\theta))}
       {\partial \theta \partial \theta^T} \right]^{-1}_{\theta = \hat{\theta}}$$`
  
--

- ignore prior and use inverse of Fisher Information (covariance of MLE)
---
### Adaptive Metropolis?

-  MCMC doesn't allow you to use the full history of the chain `$\theta^{(1)}, \ldots, \theta^{(s)}$` in constructing the proposal distributions

- violates the Markov assumption

- Workaround?  run an initial MCMC for an initial tuning phase (e.g. 1000 samples) and then fix the kernel to depend only on `$\theta^{(s-1)}$` and `$y$`.

- more elegant approach -  formal **adaptive Metropolis**

+ keep adapting the entire time!
  
--
  + this may mess up convergence !  
  
--
  + need conditions for vanishing adaptation e.g.  that the proposal depends less and less on recent states in the chain - Roberts & Rosenthal (2006) and other conditions
  
  
---
## Adaptive MCMC

- Haario et al (2001) propose a simple and effective adaptive random walk Metropolis  (RWM)

-  run  RWM with a Gaussian proposal for a fixed number of iterations for `$s < s_0$`

- estimate of covariance at state `$s$`

`$$\Sigma^{(s)} = \frac{1}{s}\left(\sum_{i=1}^s \theta^{(i)} {\theta^{(i)}}^T - 
s \bar{\theta}^{(s)} {\bar{\theta}^{(s)}}^T\right)$$`

- proposal for `$s > s_0$` with `$\delta = 2.38/\sqrt{d}$`

`$$\theta^* \sim \textsf{N}(\theta^{(s)}, \delta^2 (\Sigma^{(s)} + \epsilon I_d))$$`

- `$\epsilon > 0$` insures covariance is positive definite

- if `$s_0$` is too large will take longer for adaptation to be seen

---
## Example again

Acceptance rate now around 30-35 % of 10,000 iterations!
---
##  Metropolis-Hastings  (MH)

- Metropolis requires that the proposal distribution be symmetric

- Hastings (1970)  generalizes Metropolis algorithms to allow asymmetric proposals - aka Metropolis-Hastings or MH  `$q(\theta^* \mid \theta^{(s)})$` does not need to be the same as `$q(\theta^{(s)} \mid \theta^*)$`

--
-  propose `$\theta^*  \mid \theta^{(s)} \sim q(\theta^* \mid \theta^{(s)})$`

--
- Acceptance probability

`$$\min \left\{ 1, \frac{\pi(\theta^*) \cal{L}(\theta^*)/q(\theta^* \mid \theta^{(s)})}
{\pi(\theta^{(s)}) \cal{L}(\theta^{(s)})/q( \theta^{(s)} \mid \theta^*)} \right\}$$`

- adjustment for asymmetry in acceptance ratio is key to ensuring convergence to stationary distribution!

---
## Special cases

- Metropolis

- Independence chain

- Gibbs samplers

- Metropolis-within-Gibbs

- combinations of the above!

---
## Independence Chain

- suppose we have a good approximation `$\tilde{\pi}(\theta \mid y)$`   to  `$\pi(\theta \mid y)$`

-  Draw `$\theta^* \sim \tilde{\pi}(\theta \mid y)$` _without_ conditioning on `$\theta^{(s)}$`

- acceptance probability

`$$\min \left\{ 1, \frac{\pi(\theta^*) \cal{L}(\theta^*)/\tilde{\pi}(\theta^* \mid \theta^{(s)})}
{\pi(\theta^{(s)}) \cal{L}(\theta^{(s)})/\tilde{\pi}( \theta^{(s)} \mid \theta^*)} \right\}$$`

- what happens if the approximation is really accurate?

- probability of acceptance is `$\approx 1$`

- Important caveat for convergence:  tails of the posterior should be at least as heavy as the tails of the posterior  (Tweedie 1994)

-  Replace Gaussian by a Student-t with low degrees of freedom

- transformations of `$\theta$`

---
## Block Updates & Gibbs

So far all algorithms update all of the parameters simultaneously

- convenient to break problems in to `$K$` blocks and update them separately

- `$\theta = (\theta_{[1]}, \ldots, \theta_{[K]}) = (\theta_1, \ldots, \theta_p)$`

At iteration `$s$`, for `$k = 1, \ldots, K$` Cycle thru blocks: (fixed order or random order)
   + propose `$\theta^*_{[k]} \sim q_k(\theta_{[k]} \mid \theta_{[<k]}^{(s)}, \theta_{[>k]}^{(s-1)})$`
   
   + set `$\theta_{[k]}^{(s)} = \theta^*_{[k]}$` with probability
   
`$$\min \left\{ 1, \frac{
   \pi(\theta_{[<k]}^{(s)},\theta_{[k]}^*, 
        \theta_{[>k]}^{(s-1)})
   \cal{L}(\theta_{[<k]}^{(s)},\theta_{[k]}^*,
           \theta_{[>k]}^{(s-1)})/
   q_k(\theta_{[k]}^* \mid \theta_{[<k]}^{(s)},    
       \theta_{[>k]}^{(s-1)})}
  {\pi(\theta_{[<k]}^{(s)},\theta_{[k]}^{(s-1)}, 
        \theta_{[>k]}^{(s-1)})
   \cal{L}(\theta_{[<k]}^{(s)},\theta_{[k]}^{(s-1)},
           \theta_{[>k]}^{(s-1)})/
   q_k(\theta_{[k]}^{(s-1)} \mid \theta_{[<k]}^{(s)},    
       \theta_{[>k]}^{(s-1)})} \right\}$$`
---
## Gibbs Sampler

special case of MH

- proposal distribution `$q_k$` for the `$k$`th block is the **full conditional** distribution for `$\theta_{[k]}$`

`$$\pi(\theta_{[k]} \mid \theta_{[-k]}, y) = \frac{\pi(\theta_{[k]} , \theta_{[-k]} \mid y)}{ \pi(\theta_{[-k]} \mid y))} \propto \pi(\theta_{[k]} , \theta_{[-k]} \mid y)$$`
--

`$$\pi(\theta_{[k]} \mid \theta_{[-k]}, y)  \propto \cal{L}(\theta_{[k]} , \theta_{[-k]})\pi(\theta_{[k]} , \theta_{[-k]})$$`
--

- acceptance probability is always 1!

- even though joint distribution is messy, full conditionals may be (conditionally) conjugate and easy to sample from!

---
## Comments

- can use Gibbs steps and Metropolis Hastings steps together

- Use block sizes in Gibbs that are as big as possible to improve  mixing (proven faster convergence)

- combine with adaptive Metropolis

- Adaptive Independence Metropolis Hastings  (learn a mixture)