Final Exam

STA 211 Spring 2023 (Jiang)

Instructions and Duke Community Standard

You have an unlimited amount of time to complete the exam between 2:00 PM on May 3, 2023, and 11:59 PM the same night. Upload your exam on Gradescope (just like a homework assignment) to turn it in. Work will not be accepted after the submission window closes, with no exceptions.

There are 100 points on this exam (with an additional 3 free bonus points given to everybody), split into two questions worth 35 and 65 points, respectively.

You may use any notes, books, or existing internet resources to answer the questions. You may not collaborate or communicate to anyone except the instructor regarding the exam (e.g., you may not communicate with other students, the TAs, or post/solicit help on the internet or via any other communication means). Doing so is a violation of the Duke Community Standard.

Sign your name on your submission as a pledge to uphold the Duke Community Standard on this exam. Note that any evidence of academic dishonesty or misconduct will automatically result in a failing grade for the course.

The Duke Community Standard is reproduced as follows:

  • I will not lie, cheat, or steal in my academic endeavors;
  • I will conduct myself honorably in all my endeavors; and
  • I will act if the Standard is compromised.
Exercise 1 (35 points)

Suppose \(\mathbf{X}_{n \times p}\) has full column rank \(p\) with \(n > p\), \(\mathbf{y}\) is an \(n\) vector, and we are interested in the least squares solution to \(\mathbf{y} = \mathbf{X}\boldsymbol\beta\).

The compact singular value decomposition of \(\mathbf{X}\) is its factorization into the matrix product \(\mathbf{U}\mathbf{D}\mathbf{V}^T\), where \(\mathbf{U}_{n \times p}\) is a semi-orthogonal matrix, \(\mathbf{V}_{p \times p}\) is an orthogonal matrix, and \(\mathbf{D}_{p \times p}\) is a diagonal matrix with decreasing non-negative elements along the diagonal and 0 otherwise (since \(\mathbf{X}\) is full rank, all diagonal terms in \(\mathbf{D}\) will actually be positive).

Express the OLS estimator \(\widehat{\boldsymbol\beta}\) in terms of \(\mathbf{U}\), \(\mathbf{D}\), and \(\mathbf{V}\) in simplest possible terms. Compare this solution to \((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\). Why might someone want to use the compact singular value deposition instead?

\[\begin{align*} \widehat{\boldsymbol\beta} &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ &= ((\mathbf{U}\mathbf{D}\mathbf{V}^T)^T\mathbf{U}\mathbf{D}\mathbf{V}^T)^{-1}(\mathbf{U}\mathbf{D}\mathbf{V}^T)^T\mathbf{y}\\ &= (\mathbf{VD}^2V^T)^{-1}\mathbf{VDU}^T\mathbf{y}\\ &= \mathbf{VD}^{-2}\mathbf{V}^T\mathbf{VDU}^T\mathbf{y}\\ &= \mathbf{VD}^{-1}\mathbf{U}^T\mathbf{y} \end{align*}\]

The SVD exists for all matrices regardless of whether they’re full rank; it also avoids issues with numerical instability of inverting \(\mathbf{X}^T\mathbf{X}\) for very nearly rank-deficient matrices.

Exercise 2 (65 points)

Suppose a random variable \(Y\) is strictly positive and takes on the following density function (for \(\theta > 0\)):

\[\begin{align*} f_Y(y) &= \theta^2 y\exp(-\theta y), \hspace{0.05 in} y > 0. \end{align*}\]

Consider the following model:

\[\begin{align*} E(\mathbf{y} | \mathbf{X}) &= \exp(\mathbf{X}\boldsymbol\beta). \end{align*}\]

Is this model a GLM? If so, is it in canonical form? Show all work and explain. Regardless of whether this model is a GLM, write out the likelihood function and the Newton-Raphson steps for finding MLE estimates of \(\boldsymbol\beta\) numerically.

\[\begin{align*} f_Y(y) &= \theta^2 y\exp(-\theta y), \hspace{0.05 in} y > 0\\ &= \exp(-\theta y + 2 \log \theta)yI(y > 0). \end{align*}\]

Yes, this is a member of the one-parameter exponential family, with

\[\begin{align*} h(y) &= yI(y > 0)\\ T(y) &= y\\ \eta(\theta) &= -\theta\\ \psi(\theta) &= -2\log(\theta) \end{align*}\]

The potential model is

\[\begin{align*} E(\mathbf{y} | \mathbf{X}) &= \exp(\mathbf{X}\boldsymbol\beta), \end{align*}\]

which uses a log link. From the form of the exponential family representation above, we see that this GLM does not use the canonical link.

The expectation of this distribution is obtained by integrating by parts (with \(\theta > 0\)):

\[\begin{align*} E(Y) &= \int_0^\infty \theta^2y^2e^{-\theta y}dy\\ &= \frac{2}{\theta} \end{align*}\]

Through the GLM, we link the conditional expectation to the linear predictor as follows:

\[\begin{align*} \frac{2}{\theta} &= \exp(\mathbf{X}\boldsymbol\beta)\\ \theta &= \frac{2}{\exp(\mathbf{X}\boldsymbol\beta)} \end{align*}\]

The log-likelihood is thus

\[\begin{align*} \log \mathcal{L}(\theta | \mathbf{y}, \mathbf{X}) &= \sum_{i = 1}^n 2\log(\theta) + \log(y_i) - \theta y_i\\ &= \sum_{i = 1}^n 2\log(2) + \log(y_i) - 2\mathbf{x}_i\boldsymbol\beta - \frac{2y_i}{\exp(\mathbf{x}_i\boldsymbol\beta)} \end{align*}\]

And the first and second derivatives with respect to \(\boldsymbol\beta\) are:

\[\begin{align*} \nabla_{\boldsymbol\beta}\log \mathcal{L}(\theta | \mathbf{y}, \mathbf{X}) &= \sum_{i = 1}^n 2\mathbf{x}_i(y_ie^{-\mathbf{x}_i\boldsymbol\beta} - 1)\\ \nabla^2_{\boldsymbol\beta}\log \mathcal{L}(\theta | \mathbf{y}, \mathbf{X}) &= -\sum_{i = 1}^n 2y_ie^{-\mathbf{x}_i\boldsymbol\beta}\mathbf{x}_i\mathbf{x}_i^T. \end{align*}\]

The Newton-Raphson iterative steps are

\[\begin{align*} \boldsymbol\beta^{(t+1)} = \boldsymbol\beta^{(t)} - \left(-\sum_{i = 1}^n 2y_ie^{-\mathbf{x}_i\boldsymbol\beta^{(t)}}\mathbf{x}_i\mathbf{x}_i^T\right)^{-1}\sum_{i = 1}^n 2\mathbf{x}_i(y_ie^{-\mathbf{x}_i\boldsymbol\beta^{(t)}} - 1). \end{align*}\]

Calibration question

So that I may calibrate future exams, how long did each of these questions take you, and what was your experience on this exam? Was it about what you expected?