STA 601: Lecture 3

class: center, middle, inverse, title-slide

# STA 601: Lecture 3
## The Normal Model & Prior/Posterior Predictive Distributions
### Merlise Clyde
### 9/1/2021

---

## Outline

- Normal Model

- Predictive Distributions

- Prior Predictive; useful for prior elicitation

- Posterior Predictive

--
  + Predicting/forecasting future events
  
--

- Comparing Estimators
  
---
## Normal Model Setup

- Suppose we have independent observations 
`$$\mathbf{y} = (y_1,y_2,\ldots,y_n)^T$$`

--
 
where each `$y_i \sim \textsf{N}(\theta, \sigma^2)$`  (iid)

- We will see that it is more convenient to work with `$\tau = 1/\sigma^2$`   (precision)

- reparameterizing the model for the data we have
`$$y_i \sim \mathcal{N}(\theta, \tau^{-1})$$`
--

- for simplicity  we will treat `$\tau$` as known initially.

---
## Marginal  Distribution

- Recall that the **marginal distribution** is

`$$p({y}) = p(y_1,\ldots,y_n) = \int_\Theta p(y_1,\ldots,y_n \mid \theta) \pi(\theta)\, d\theta$$`

- this is also called the **prior predictive ** distribution and is independent of any unknown parameters

- We may care about making predictions before we even see any data.

- This is often useful as a way to see if the sampling distribution or prior we have chosen is appropriate, after integrating out all unknown parameters.

- Need to specify a prior for `$\theta$` on `$\mathbb{R}$`

---
## Prior for a Normal Mean

- Natural choice is a Normal/Gaussian distribution  (Conjugate prior)

`$$\theta \sim \textsf{N}(\theta_0, 1/\tau_0)$$`
--

- `$\theta_0$` is the prior mean - best guess for `$\theta$` using information other than `$\mathbf{y}$`

- Prior variance `$\sigma^2_0 = 1/\tau_0$`

--
- `$\tau_0$` is the prior precision and expresses our certainty about this guess

- one notion of non-informative is to let `$\tau_0 \to 0$`
 
--

- better justification is as Jeffreys' prior (uniform measure)  ** Derive**

`$$\pi(\theta) \propto 1$$`

- parameterization invariant and invariant to shift changes in the data  (group invariance)

---
## Prior Predictive  for a Single Case

`$$\begin{split} p(y) & \propto \int_\mathbb{R} p(y \mid \theta) \pi(\theta) \, d\theta \\
& \propto \int_\mathbb{R}\exp\left\{- \frac 1 2 \tau (y - \theta) ^2 \right\} \exp\left\{-  \frac 1 2 \tau_0(\theta - \theta_0) ^2 \right\} \, d\theta
\end{split}$$`

Quadratic
`$\tau_0(\theta - \theta_0)^2 = \tau_0 \theta^2 - 2 \tau_0 \theta_0 \theta + \tau_0 \theta_0^2$`

Steps:

1) **Expand** quadratics

2) **Group** terms with `$\theta^2$` and `$\theta$`

3) Read off **posterior precision** and **posterior mean**

4) **Complete the square**

5) **Integrate** out `$\theta$` to obtain marginal!

---
<div class="question">
Try it!
</div>

`$$p(y) \propto \int_\mathbb{R}\exp\left\{- \frac 1 2 [\tau (y - \theta) ^2 + \tau_0(\theta - \theta_0) ^2 \right\} \, d\theta$$`

---
## Results

Posterior for `$\theta$` based on a single observation (Conjugate family)

`$$\theta \mid y \sim \textsf{N} \left(\hat{\theta}, \frac{1}{\tau_0 + \tau} \right)$$`
--

- posterior mean `$\hat{\theta} = \frac{\tau_0} {\tau_0 + \tau} \theta_0 + \frac{\tau}{\tau_0 + \tau} y$`

- precision weighted average of prior mean and MLE (based on 1 observation)

- posterior precision is the sum of prior precision and data precision

- marginal distribution for `$Y$` (prior predictive)

`$$Y \sim \textsf{N}\left(\theta_0, \frac{1}{\tau_0} + \frac{1}{\tau}\right) \text{ or } \textsf{N}(\theta_0, \sigma^2 + \sigma^2_0)$$`

- two sources of variability: variability from the model for the data and prior variability

---
## Prior Predictive

- useful to think about observable quantities when choosing the prior

- sample directly  from the prior predictive and assess whether the samples are consistent with our prior knowledge

- if not, go back and  modify the prior & repeat

- sequential substitution sampling  (repeat `$T$` times)

1) draw `$\theta^{(t)} \sim \pi(\theta)$`

2) draw `$y^{(t)} \sim p(y \mid \theta^{(t)})$`

- takes into account uncertain about `$\theta$` and variability in `$Y$`!

---
## Posterior Updating

- Sequential updating using the previous result as our prior!

- New prior after seeing 1 observation is `$$\textsf{N}(\theta_1, 1/\tau_1)$$`

- prior mean weighted average
  `$$\theta_1 \equiv \frac{\tau_0 \theta_0 + \tau y_1}{\tau_0 + \tau_1}$$`
  
--

- prior precision after 1 observation
  `$$\tau_1 \equiv \tau_0 + \tau$$`

- prior variance is now `$\sigma^2_1 = 1/\tau_1$`

---
## Posterior Predictive for `$y_2$` given `$y_1$`

- Conditional `$p(y_2 \mid y_1) = p(y_2, y_1)/p(y_1)$`   (Hard way!)

- Use latent variable representation

`$$p(y_2 \mid y_1) = \int_\Theta \frac{p(y_2, \mid \theta) p( y_1 \mid \theta ) \pi(\theta) \, d\theta}{p(y_1)}$$`
--

- simplify to previous problem and use results
`$$p(y_2 \mid y_1) =  \int_\Theta p(y_2 \mid \theta) \pi(\theta \mid y_1) \, d\theta$$`
--

- (Posterior) Predictive

`$$y_2 \mid y_1 \sim \textsf{N}(\theta_1, \sigma^2 + \sigma^2_1)$$`
---
## Iterated Expectations

Based on expressions we have an exponential of a quadratic in `$y_2$` so know that distribution is Gaussian

- Find the mean and variance using iterated expectations:

- mean 
`$$\textsf{E}[Y_2 \mid y_1] = \textsf{E}_{\theta \mid y_1}[\textsf{E}_{Y_2 \mid y_1, \theta} [Y_2 \mid y_1, \theta] \mid y_1]$$`
---
## Variance via Iterated Expectations

`$\textsf{Var}[Y_2  \mid y_1] =$`

`$$\textsf{E}_{\theta \mid y_1}[\textsf{Var}_{Y_2 \mid y_1, \theta} [Y_2 \mid y_1, \theta] \mid y_1] + \textsf{Var}_{\theta \mid y_1}[\textsf{E}_{Y_2 \mid y_1, \theta} [Y_2 \mid y_1, \theta] \mid y_1]$$`

---
## Updated Posterior for `$\theta$`

`$$p(\theta \mid y_1, y_2) \propto p(y_2 \mid \theta) p(y_1 \mid \theta) \pi(\theta)$$`

`$$p(\theta \mid y_1, y_2) \propto  p(y_2 \mid \theta) p(\theta \mid y_1)$$`
--

Apply previous updating rules

- new posterior mean

`$$\theta_2 = \frac{\tau_1 \theta_1  + \tau y_2}{\tau_1 + \tau} = \frac{\tau_0 \theta_0 + 2 \tau \bar{y}}
{\tau_0 + 2 \tau}$$`
--

- new precision

$$ \tau_2 = \tau_1 + \tau = \tau_0 + 2 \tau$$

---
## After `$n$` observations

Posterior for `$\theta$`

`$$\theta \mid y_1, \ldots, y_n \sim \textsf{N}\left( \frac{\tau_0 \theta_0 + n \tau \bar{y}}
{\tau_0 + n \tau}, \frac{1}{ \tau_0 + n \tau} \right)$$`

Posterior Predictive Distribution for `$Y_{n+1}$`

`$$Y_{n+1} \mid y_1, \ldots, y_n \sim \textsf{N}\left( \frac{\tau_0 \theta_0 + n \tau \bar{y}}
{\tau_0 + n \tau}, \frac{1}{\tau} + \frac{1}{ \tau_0 + n \tau} \right)$$`

- Shrinkage of the MLE to the prior mean

- More accurate estimation of `$\theta$` as `$n \to \infty$` (reducible error)

- Cannot reduce the error for prediction `$Y_{n+1}$` due to `$\sigma^2$`

- predictive distribution for a next observation given _everything_ we know - prior and likelihood

---
## Results with Jeffreys' Prior
 - What if `$\tau_0 \to 0$`? (or `$\sigma^2_0 \to \infty$`)
 
--

- Prior predictive `$\textsf{N}(\theta_0, \sigma^2_0 + \sigma^2 )$`  (not  proper in the limit)

- Posterior for `$\theta$` (formal posterior)

`$$\theta \mid y_1, \ldots, y_n \sim \textsf{N}\left( \frac{\tau_0 \theta_0 + n \tau \bar{y}}
{\tau_0 + n \tau}, \frac{1}{ \tau_0 + n \tau} \right)$$`

`$$\to  \qquad \theta \mid y_1, \ldots, y_n \sim \textsf{N}\left( \bar{y}, 
 \frac{1}{n \tau} \right)$$`
--

- Recovers the MLE as the posterior mode!

- Posterior variance of `$\theta = \sigma^2/n$` (same as variance of the MLE)

---
## Posterior Predictive Distribution

Posterior predictive distribution for `$Y_{n+1}$`

`$$Y_{n+1} \mid y_1, \ldots, y_n \sim \textsf{N}\left( \frac{\tau_0 \theta_0 + n \tau \bar{y}}
{\tau_0 + n \tau}, \frac{1}{\tau} + \frac{1}{ \tau_0 + n \tau} \right)$$`

Under Jeffreys' prior

`$$Y_{n+1} \mid y_1, \ldots, y_n \sim \textsf{N}\left( \bar{y}, \sigma^2 (1 + \frac{1}{n} )\right)$$`
--

Captures extra uncertainty due to not knowing `$\theta$` (compared to plug-in approach where we plug in MLE in sampling model!

---
## Comparing Estimators

Expected loss (from frequentist perspective) of using Bayes Estimator

- Posterior mean is optimal under squared error loss (min Bayes Risk)  [also absolute error loss]

Compute Mean Square Error (or Expected Average Loss)
`$$\textsf{E}_{\bar{y} \mid \theta}\left[\left(\hat{\theta} - \theta \right)^2 \mid \theta \right]$$`
 $$ = \textsf{Bias}(\hat{\theta})^2 + \textsf{Var}(\hat{\theta})$$
--

- For the MLE `$\bar{Y}$` this is just the variance of `$\bar{Y}$` or `$\sigma^2/n$`

---
## MSE for Bayes

`$$\textsf{E}_{\bar{y} \mid \theta}\left[\left(\hat{\theta} - \theta \right)^2 \mid \theta \right] = \textsf{MSE} =  \textsf{Bias}(\hat{\theta})^2 + \textsf{Var}(\hat{\theta})$$`

- Bias of Bayes Estimate

`$$\textsf{E}_{\bar{Y} \mid \theta}\left[ \frac{\tau_0 \theta_0 + \tau n \bar{Y}}
{\tau_0  + \tau n}\right] = 
\frac{\tau_0(\theta_0 - \theta)}{\tau_0 + \tau n}$$`

- Variance

`$$\textsf{Var}\left(\frac{\tau_0 \theta_0 + \tau n \bar{Y}}{\tau_0 + \tau n} - \theta  \mid \theta \right)  = \frac{\tau n}{(\tau_0 + \tau n)^2}$$`
--

(Frequentist) expected Loss  when truth is `$\theta$`

`$$\textsf{MSE} = \frac{\tau_0^2(\theta - \theta_0)^2 + \tau n}{(\tau_0 + \tau n)^2}$$`
--

Behavior ?

---
## Plot

---
## Updating with `$n$` Observations
 
 -  We can use the `$\cal{L}(\theta)$` based on `$n$` observations  and repeat completing the square with the original prior  `$\theta \sim \textsf{N}(\theta_0, 1/\tau_0)$`

---
## Likelihood Function

- The likelihood for `$\theta$` is proportional to the sampling model

`$$p(y \mid \theta,\tau)  = 
\prod_{i=1}^n \frac{1}{\sqrt{2\pi}} \tau^{\frac{1}{2}}  \exp{\left\{-\frac{1}{2} \tau (y_i-\theta)^2\right\}}$$`

<div class="question">
Rewrite in terms of sufficient statistics!
</div>

---
## Simplification

`$$\begin{split}
\cal{L}(\theta) & \propto \tau^{\frac{n}{2}} \ \exp\left\{-\frac{1}{2} \tau \sum_{i=1}^n (y_i-\theta)^2\right\}\\
& \propto \tau^{\frac{n}{2}} \ \exp\left\{-\frac{1}{2} \tau \sum_{i=1}^n \left[ (y_i-\bar{y}) - (\theta - \bar{y}) \right]^2 \right\}\\
\\
& \propto \tau^{\frac{n}{2}} \ \exp\left\{-\frac{1}{2} \tau \left[ \sum_{i=1}^n (y_i-\bar{y})^2 + \sum_{i=1}^n(\theta - \bar{y})^2 \right] \right\}\\
& \propto \tau^{\frac{n}{2}} \ \exp\left\{-\frac{1}{2} \tau \left[ \sum_{i=1}^n (y_i-\bar{y})^2 + n(\theta - \bar{y})^2 \right] \right\}\\
& \propto \tau^{\frac{n}{2}} \ \exp\left\{-\frac{1}{2} \tau s^2(n-1) \right\} \ \exp\left\{-\frac{1}{2} \tau n(\theta - \bar{y})^2 \right\}.\\
& \propto \exp\left\{-\frac{1}{2} \tau n(\theta - \bar{y})^2\right\}
\end{split}$$`

---
## Exercises for Practice

<div class="question">
Try this

</div>
 
1)  Use `$\cal{L}(\theta)$`  based on `$n$` observations and `$\pi(\theta)$` to find  `$\pi(\theta \mid y_1, \ldots, y_n)$` based on the sufficient statistics

2)
Use `$\pi(\theta \mid y_1, \ldots, y_n)$` to find the posterior predictive distribution for `$Y_{n+1}$`