https://commons.wikimedia.org/wiki/File:Bayes%27_Theorem_MMB_01.jpg

HW3

Date

Thu, Sep 7, 2017

Links

Reading

Due Wednesday 09/13/2017 11:55:00 PM

Please try to complete before class on Tuesday in case there are questions or clarifications needed (or post on Piazza). Use LaTeX or write by hand (must be legible) and scan to submit via Sakai. For the data analysis part, please use RStudio with RMarkdown or knitr to create a pdf document to upload.

Consider the linear model $Y \sim N(\mu, \sigma^2 I_n)$ with $\mu = 1_n \beta_0 + X \beta$ and $X$ a full rank matrix with rank $p$.

a) Show that the projection, $P$, on the column space spanned by the vector $1$ of length $n$ and $X$ may be written as $$P = P_1 + P_{X - 1_n \bar{x}^T}$$ where $X - 1_n \bar{x}^T = (I - P_1) X$. Show that diagonal elements are $$h_{ii} = \frac{1}{n} + (x_i-\bar{x})^T\left((X- 1_n \bar{x}^T)^T(X - 1_n \bar{x}^T)\right)^{-1}(x_i - \bar{x})$$ (recall all vectors are column vectors). The $h_{ii}$ are known as the leverage values.

b) Find the sampling distribution of $\hat{\mu}_i$ (the mean of $Y_i$ at $x_i^T$ as a function of $h_{ii}$ and provide an expression for a 95% confidence interval. For what values of $x$ will the interval be the narrowest? Explain.

c) Given $\sigma^2$, find the distribution of $e_i$ as a function of $h_{ii}$. Explain (rigorously) why $e_i$ unconditional on $\sigma^2$ does not have a student $t$ distribution with $n - p - 1$ degrees of freedom.
Now consider predicting $Y_{*}$ at a new point $x_{*}^T$ where $Y_{*} \sim N(1 \beta_0 + x_*^T\beta, \sigma^2)$.

a) Find the distribution of the predicted residual $e_{*}= Y_* - 1\hat{\beta}_0 - x_*^T \hat{\beta}$ (given $\beta_0, \beta$ and $\sigma^2$). Both $Y$ and $Y_{*}$ are random variables here.

b) Show that the standardized predicted residual (center so that the mean is 0 and and scale (sd) is 1 with $\sigma^2$ replaced by the usual unbiased estimate $\hat{\sigma}^2 = Y^T (I − P_X)Y/(n − p − 1)$ has a student $t$ distribution. What are the degrees of freedom? (Explain)

c) Use the standardized predicted residual to construct a 95% Confidence interval (also called a prediction interval) for $Y_{*}$.
Refer to the Prostate data from library(lasso2) (data(Prostate) (see R code from Lecture.)

a) Fit the regression model with response lcavol, and variables svi and lpsa as predictors. Construct 95% confidence intervals for each coefficient and provide a meaningful interpretations for changes in the median cancer volume ( not log cancer volume) include any units etc in your interpretation. Note ``a 1 unit” change may or may not be meaningful for interpretation so adjust as needed.

b) Plot the cancer volume versus PSA using a log scale on both axes. Add the fitted regression function for svi = 1 and svi = 0, with lines representing the (pointwise) 95% confidence intervals (CI) for each. Use a different color and line type for the fitted function and the confidence intervals. Hint: see the predict function in R to obtain the confidence intervals

c) Add to the plot 95% prediction intervals from predict using a different line type and color from the CI. Add a legend to your plot.

d) Why are the prediction intervals wider than the conficence intervals?

Review Chapter 2 and Appendix C in Plane Answers to Complex Questions on Vector Spaces

Review Material from the StatSci Computing Bootcamp for R or other links under Resources