Due Friday 09/22/2017 3:15:00 PM
Please try to complete before class in case there are questions or clarifications needed (or post on Piazza). Use LaTeX or write by hand (must be legible) and scan to submit via Sakai.
Consider the linear model $Y = X\beta + \epsilon$ with $E[\epsilon] = 0_n$ and Cov$(\epsilon) = \sigma^2 I_n$.
Show that $P_{X^T} = (X^TX) (X^TX)^{-}$ is a projection onto the column space of $X^T$ where $(X^TX)^{-}$ is a generalized inverse of $X^TX$. Does this depend on the actual choice of generalized inverse? (explain) Is this an orthogonal projection?
Show that for an estimable function $\lambda = X^T a$ with $a \in C(X)$ that $(I - P_{X^T}) \lambda = 0$
Using the spectral decomposition of $(X^TX)$ and the Moore-Penrose generalized inverse (see class notes) find a simple expression for $I - P_{X^T}$ in terms of a reduced set of the eigenvectors of $X^TX$.
If $X$ is full column rank $p$, does a Best Linear Unbiased Prediction (BLUP) exist for all $x_* \in \mathbb{R}^{p}$ ($x_* \neq 0$)? Prove or Disprove.
(optional) Write a function in R
to find the projection
$(I - P_{X^T}) \lambda$ with the design matrix $X$ (with intercept) and $\lambda$ (vector or matrix) as input. (post the R
code on Piazza).
Apply your function to the example from class and compare to the
conclusions from epredict
. What sort of tolerance do you need to
decide if $(I - P_{X^T}) \lambda = 0$?. Extra challenge -
have your function return the estimates, SE and confidence intervals!
(collaborative function creation encouraged)
For the Prostate data: create dummy or indicator
variables for each of the levels of the gleason
scores and add to the dataframe; e.g.
Prostate$D7 = (gleason == 7)
(use base R or explore mutate
from dplyr
). Show that they are linearly related to the intercept.
Fit a linear model of with response lcavol
including all
of the dummy variables and the intercept. What are the coefficients? If you change the order that the dummy variables enter the model formula, what happens to the coefficients? If you force the intercept to be zero (add -1 to the formula) what are the results?
Using as.factor(gleason)
as a predictor in lm
, what is the
equivalent model formula using dummy variables? See
model.matrix
to extract the design matrix.
What are the interpretation for these coefficients? (provide an
explanation in a couple of sentences with the actual estimates. Full credit for interpretation with original units.)
In the model with all dummy variables and the intercept, use the theorem from class to show that each of the individual coefficients of the dummy variables and intercept are not estimable.
(for the energetic student. otherwise optional) The epredict
function assumes that the intercept is always included, so any linear combination of $\beta$
always has the intercept added which means we cannot use the
function to see if individual $\beta_j$ are estimable via a
$\lambda = (0, 0, 1, \ldots 0)^T$. Create a new variable that is a
column of ones Prostate$Int = rep(1, n)
and fit the model
using the formula lpsa ~ Int + D6 + D7 + D8 + D9 -1
where D7
is the dummy variable indication that the gleason score is
7 and -1
drops the column of ones added by default. Create a data
frame for predicting that will let you demonstrate with epredict
that none of the individual $\beta_j$ are estimable.
Review Chapter 2 and 6 and Appendices in Plane Answers to Complex Questions