Due Wednesday 09/13/2017 11:55:00 PM
Please try to complete before class on Tuesday in case there are questions or clarifications needed (or post on Piazza). Use LaTeX or write by hand (must be legible) and scan to submit via Sakai. For the data analysis part, please use RStudio with RMarkdown or knitr to create a pdf document to upload.
Consider the linear model $Y \sim N(\mu, \sigma^2 I_n)$ with $\mu = 1_n \beta_0 + X \beta$ and $X$ a full rank matrix with rank $p$.
a) Show that the projection, $P$, on the column space spanned by the vector $1$ of length $n$ and $X$ may be written as $$P = P_1 + P_{X - 1_n \bar{x}^T}$$ where $X - 1_n \bar{x}^T = (I - P_1) X$. Show that diagonal elements are $$h_{ii} = \frac{1}{n} + (x_i-\bar{x})^T\left((X- 1_n \bar{x}^T)^T(X - 1_n \bar{x}^T)\right)^{-1}(x_i - \bar{x})$$ (recall all vectors are column vectors). The $h_{ii}$ are known as the leverage values.
b) Find the sampling distribution of $\hat{\mu}_i$ (the mean of $Y_i$ at $x_i^T$ as a function of $h_{ii}$ and provide an expression for a 95% confidence interval. For what values of $x$ will the interval be the narrowest? Explain.
c) Given $\sigma^2$, find the distribution of $e_i$ as a function of $h_{ii}$. Explain (rigorously) why $e_i$ unconditional on $\sigma^2$ does not have a student $t$ distribution with $n - p - 1$ degrees of freedom.
Now consider predicting $Y_{*}$ at a new point $x_{*}^T$ where $Y_{*} \sim N(1 \beta_0 + x_*^T\beta, \sigma^2)$.
a) Find the distribution of the predicted residual $e_{*}= Y_* - 1\hat{\beta}_0 - x_*^T \hat{\beta}$ (given $\beta_0, \beta$ and $\sigma^2$). Both $Y$ and $Y_{*}$ are random variables here.
b) Show that the standardized predicted residual (center so that the mean is 0 and and scale (sd) is 1 with $\sigma^2$ replaced by the usual unbiased estimate $\hat{\sigma}^2 = Y^T (I − P_X)Y/(n − p − 1)$ has a student $t$ distribution. What are the degrees of freedom? (Explain)
c) Use the standardized predicted residual to construct a 95% Confidence interval (also called a prediction interval) for $Y_{*}$.
Refer to the Prostate data from library(lasso2)
(data(Prostate)
(see R code from Lecture.)
a) Fit the regression model with
response lcavol
, and variables svi
and lpsa
as
predictors. Construct 95% confidence intervals for
each coefficient and provide a meaningful interpretations for
changes in the median cancer volume ( not log cancer volume) include any
units etc in your interpretation. Note ``a 1 unit” change may or may not be
meaningful for interpretation so adjust as needed.
b) Plot the cancer volume versus PSA using a log scale on both axes. Add
the fitted regression function for svi = 1 and svi = 0, with
lines representing the (pointwise) 95% confidence intervals (CI) for
each. Use a different color and line type for the fitted function and the confidence intervals. Hint: see the predict
function in R
to obtain the confidence intervals
c) Add to the plot 95% prediction intervals from predict
using a different line type and color from the CI. Add a legend to your plot.
d) Why are the prediction intervals wider than the conficence intervals?
Review Chapter 2 and Appendix C in Plane Answers to Complex Questions on Vector Spaces
Review Material from the StatSci Computing Bootcamp for R
or other links under Resources