https://commons.wikimedia.org/wiki/File:Bayes%27_Theorem_MMB_01.jpg

HW8

Date

Thu, Oct 12, 2017

Links

Reading

See Due date in Sakai

Please look at before class Tuesday in case there are questions or clarifications needed (or post on Piazza). Use LaTeX or write by hand (must be legible) and scan to submit via Sakai.

Let $X^s$ denote the centered and scaled design matrix (see slides or text) for the following problems.

$$\begin{aligned} Y & = 1 \alpha + X^s \beta + \epsilon \\
U^TY & = U^T 1 \alpha + U^T X^s \beta + U^T\epsilon \\
\left( \begin{array}{c} y_0^* \\
y_1^* \\
\vdots \\
y_p^* \\
y_{p+1}^* \\
\vdots \\
y_{n-1}^* \end{array} \right) & =
\left[ \begin{array}{ccccc} \sqrt{n} & 0 & 0 & \ldots & 0\\
0 & l_1 & 0 & \ldots & 0 \\
\vdots & 0 & \ddots & 0 & \ldots \\
\vdots& 0 & \vdots & \ddots & 0 \\
\vdots& \vdots & \vdots & 0 & l_p \\
\vdots & \vdots & \vdots &\vdots & 0 \\
\vdots & \vdots & \vdots &\vdots & \vdots \\
0 & \ldots & \ldots & \ldots & 0
\end{array} \right] \left( \begin{array}{c} \alpha \\
\gamma_1 \\
\vdots \\
\gamma_p \end{array} \right) + \epsilon^* \end{aligned}$$ where$X^s$ has been centered and standardized so that each column has length 1 and mean 0 (see scale in R), $U = [1_n/\sqrt{n} \, U_p \,U_{n-p-1}]$ is an orthogonal matrix with $X^s = U_p L V^T$ from the singular value decomposition, and $\gamma = V^T\beta$.

(optional) . Verify the above equations starting from the model for $Y$ to the rotated model for $Y^*$.
Verify that OLS estimates $\hat{\alpha} = \bar{y}$ and $\hat{\gamma}_i = y_i^*/l_i$ for $i = 1, \ldots, p$.
Show that $\textsf{SSE} = Y^TU_{n-p-1}U_{n-p-1}^T Y = Y(I - \frac{1}{n}1 1^T - U_p U_p^T)Y.$ Do you need to find $U_{n-p-1}^TY$ to compute SSE? Hint show that you can express SSE as a function of the MLEs of $\alpha$ and $\hat{\gamma}$s (and $Y^TY$)? \item Write the likelihood function as a product of three terms: a part that involves only $\alpha \mid \phi$, $\gamma \mid \phi$ and $\phi$ with the MLES and SSE using $Y^*$.
Derive conditional distributions for (a) $\alpha$, $\gamma$, $\phi$ given $\kappa_j, Y$ and the full conditionals for (b) $\kappa_j$ given $\alpha, \gamma, \phi, Y$ $$p(\alpha, \phi) \propto 1/\phi$$ $$\kappa_j \mid \phi, \alpha \sim G(\frac{1}{2}, \frac{1}{2})$$ $$\gamma_j \mid \kappa_j, \phi, \alpha \sim N(0, \frac{1}{\phi \kappa_j})$$ You should have a name for the distribution and
expressions for all hyperparameters, not just an expression for the density. Hint: write down likelihoods and priors, but ignore any terms that do not involve the parameter of interest (they go into constant of proportionality). Simplify until you recognize the distribution.
What prior on $\kappa_j$’s leads to the g-prior?
Modify your Gamma prior on $\kappa_i$ to try to capture the desired features of Goldstein & Smith (1974) (See Christensen Chapter 15) where if $$ \gamma_i^2 < \sigma^2\left[ \frac{2}{\kappa_i} + \frac{1}{l_i^2} \right] $$ the fixed $\kappa_i$ Generalized Ridge shrinkage estimator (posterior mean) beats OLS: if $l_i$ is small almost any $\kappa_i$ will improve over OLS and if $l_i^2$ is large then only very small values of $\kappa_i$ will give an improvement based on $l_i$, i.e. if $l_i^2$ is large $\kappa_i$ should be small. For $p=25$ with eigenvalues $l_i^2$ based on seq(0, 10, length=25), plot your prior densities, with an overlay of the G(¹⁄₂, ¹⁄₂) and comment.
Find the updated full conditionals based on your choice above. Do you need to update all of the full conditionals? Explain.
Implement your 2 generalized ridge models in R, JAGS or other language (see earlier JAGS code as a starting point) and apply this to the longley data. How do your results compare to classical ridge? Include histograms of the posterior distributions of coefficients, plus means and credible intervals, as well as histograms of the $\kappa$’s with the prior density overlaid. How sensitive are the results to the prior assumptions? How do the estimates of $\kappa_i$ compare to the best GCV estimate from lm.ridge from class?
Explain the computational advantage of using the canonical parameterization in MCMC.
Provide interpretations of the estimates/credible intervals in the context of the problem. Does the idea of a “one-unit” change in a predictor with all other variables held fixed make sense for this data set?