HW 3 Solutions

STA 211 Spring 2023 (Jiang)

Exercise 1

Recall the \(\mathbf{QR}\) factorization of a full column rank \(n \times p\) matrix \(\mathbf{X}\) into the product of an \(n \times p\) matrix \(\mathbf{Q}\) with orthonormal columns and an invertible upper triangular \(p \times p\) matrix \(\mathbf{R}\). Express the least squares solution \(\widehat{\boldsymbol\beta}\) in terms of \(\mathbf{Q}\) and \(\mathbf{R}\). Compare this solution to \((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\). Why might someone want to use the \(\mathbf{QR}\) decomposition instead?

\[\begin{align*} (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} &= ((\mathbf{QR})^T\mathbf{QR})^{-1}\mathbf{QR}^T\mathbf{y}\\ &= (\mathbf{R}^T(\mathbf{Q}^T\mathbf{Q})\mathbf{R})^{-1}\mathbf{R}^T\mathbf{Q}^T\mathbf{y}\\ &= (\mathbf{R}^T\mathbf{R})^{-1}\mathbf{R}^T\mathbf{Q}^T\mathbf{y}\\ &= \mathbf{R}^{-1}(\mathbf{R}^{T^{-1}}\mathbf{R}^T)\mathbf{Q}^T\mathbf{y}\\ &= \mathbf{R}^{-1}\mathbf{Q}^T\mathbf{y} \end{align*}\]

The main reason is that matrix inversion is a horrible operation computationally! The solution \(\widehat{\boldsymbol{\beta}} = \mathbf{R}^{-1}\mathbf{Q}\mathbf{y}\) often easier to deal with computationally - inverting an upper triangular matrix is an “easier” task (in that it takes many fewer calculations) than inverting the matrix \(\mathbf{X}^T\mathbf{X}\). As well, the matrix \(\mathbf{X}^T\mathbf{X}\) is less stable numerically (its condition number is squared), and so use of the \(\mathbf{QR}\) decomposition is preferred as it is less sensitive to numerical perturbations (this often comes into play with “very nearly rank-deficient” design matrices, as happens with predictors that are highly correlated with each other).

As a final reason (that I don’t expect anyone to know, but it may come in handy one day, such as in STA 360!), there are benefits to using the \(\mathbf{QR}\) decomposition in Bayesian inference, again regarding very highly correlated predictors. Highly correlated predictors often present computational issues. For instance, Gibbs samplers will have difficulty exploring the full space (resulting in a huge loss in computational efficiency), and gradient-based methods (like HMC) will find it difficult to deal with the “weird geometry” of the posterior (it’ll be very narrow along certain dimensions). The \(\mathbf{QR}\) decomposition helps address these computational issues since \(\mathbf{Q}\) is orthogonal. Again, not important now, but maybe one day you’ll face this issue and then vaguely remember you heard about this sort of thing somewhere (and it’ll make more sense the second time around).

Exercise 2

Consider the estimation problem encountered on Slide 9. Use the \(\mathbf{QR}\) decomposition to solve for the least squares solution.

[n.b. Least squares solution to the following].

\[\begin{align*} \begin{bmatrix} 2 \\ 4 \\ 5 \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} \end{align*}\]

Using the Gram-Schmidt process on \(\mathbf{X}\):

\[\begin{align*} \mathbf{u}_1 &= \begin{bmatrix}1 \\ 1 \\ 1\end{bmatrix}\\ \mathbf{e}_1 &= \begin{bmatrix}\frac{\sqrt{3}}{3} \\ \frac{\sqrt{3}}{3} \\ \frac{\sqrt{3}}{3}\end{bmatrix}\\ \mathbf{u}_2 &= \begin{bmatrix}1 \\ 2 \\ 3\end{bmatrix} - \frac{\begin{bmatrix}1 & 2 & 3\end{bmatrix} \begin{bmatrix}1 \\ 1 \\ 1\end{bmatrix}}{(\sqrt{3})^2} \begin{bmatrix}1 \\ 1 \\ 1\end{bmatrix}\\ &= \begin{bmatrix}1 \\ 2 \\ 3\end{bmatrix} - \begin{bmatrix}2 \\ 2 \\ 2\end{bmatrix}\\ &= \begin{bmatrix}-1 \\ 0 \\ 1\end{bmatrix}\\ \mathbf{e}_2 &= \begin{bmatrix}-\frac{\sqrt{2}}{2} \\ 0 \\ \frac{\sqrt{2}}{2}\end{bmatrix} \end{align*}\]

Thus, the relevant matrices in the \(\mathbf{QR}\) decomposition are

\[\begin{align*} \mathbf{Q} &= \begin{bmatrix} \frac{\sqrt{3}}{3} & -\frac{\sqrt{2}}{2}\\ \frac{\sqrt{3}}{3} & 0\\ \frac{\sqrt{3}}{3} & \frac{\sqrt{2}}{2} \end{bmatrix}\\ \mathbf{R} &= \begin{bmatrix} \frac{\sqrt{3}}{3} & \frac{\sqrt{3}}{3} & \frac{\sqrt{3}}{3}\\ -\frac{\sqrt{2}}{2} & 0 & \frac{\sqrt{2}}{2} \end{bmatrix}\begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix}\\ &= \begin{bmatrix} \sqrt{3} & 2\sqrt{3}\\ 0 & \sqrt{2} \end{bmatrix} \end{align*}\]

The inverse of \(\mathbf{R}\) is

\[\begin{align*} \mathbf{R}^{-1} = \begin{bmatrix} \frac{\sqrt{3}}{3} & -\sqrt{2}\\ 0 & \frac{\sqrt{2}}{2} \end{bmatrix} \end{align*}\]

The least squares solution is

\[\begin{align*} \widehat{\boldsymbol\beta} &= \mathbf{R}^{-1}\mathbf{Q}^T\mathbf{y}\\ &= \begin{bmatrix} \frac{\sqrt{3}}{3} & -\sqrt{2}\\ 0 & \frac{\sqrt{2}}{2} \end{bmatrix} \begin{bmatrix} \frac{\sqrt{3}}{3} & \frac{\sqrt{3}}{3} & \frac{\sqrt{3}}{3} \\ -\frac{\sqrt{2}}{2} & 0 & \frac{\sqrt{2}}{2} \end{bmatrix} \begin{bmatrix} 2 \\ 4 \\ 5 \end{bmatrix}\\ &= \begin{bmatrix} \frac{\sqrt{3}}{3} & -\sqrt{2}\\ 0 & \frac{\sqrt{2}}{2} \end{bmatrix} \begin{bmatrix} \frac{11\sqrt{3}}{3} \\ \frac{3\sqrt{2}}{2} \end{bmatrix}\\ &= \begin{bmatrix} \frac{2}{3} \\ \frac{3}{2} \end{bmatrix} \end{align*}\]

Exercise 3

Explain why the residuals \(\mathbf{y} - \widehat{\mathbf{y}}\) live in the orthogonal complement of the space spanned by \(\mathbf{X}\). What is the dimension of this space?

The OLS solution \(\widehat{\boldsymbol\beta}\) was found by minimizing the MSE, \(\frac{1}{n}(\mathbf{y} - \mathbf{X}\boldsymbol\beta)^T(\mathbf{y} - \mathbf{X}\boldsymbol\beta)\). The first derivative of this function is proportional to \(-\mathbf{X}^T\mathbf{y} + \mathbf{X}^T\mathbf{X}\boldsymbol\beta\), and thus any minimizing solution \(\widehat{\boldsymbol\beta}\) must satisfy \(\mathbf{X}^T\mathbf{X}\widehat{\boldsymbol\beta} = \mathbf{X}^T\mathbf{y}\) (that is, must be a critical point). That is to say:

\[\begin{align*} \mathbf{X}^T\mathbf{X}\widehat{\boldsymbol\beta} &= \mathbf{X}^T\mathbf{y}\\ \mathbf{0} &= \mathbf{X}^T\mathbf{y} - \mathbf{X}^T\mathbf{X}\widehat{\boldsymbol\beta}\\ \mathbf{0} &= \mathbf{X}^T(\mathbf{y} - \mathbf{X}\widehat{\boldsymbol\beta})\\ \mathbf{0} &= \mathbf{X}^T(\mathbf{y} - \widehat{\mathbf{y}}). \end{align*}\]

As the orthogonal complement of \(\mathcal{C}(X)\) is the set of vectors that are orthogonal to all vectors in \(\mathcal{C}(X)\) and for the least squares solution, the residual vector \(\mathbf{y} - \widehat{\mathbf{y}}\) is orthogonal to every column of \(\mathbf{X}\) (noting that the columns of \(\mathbf{X}\) form a basis for \(\mathcal{C}(\mathbf{X})\)), we see that the residuals are in the null space of \(\mathbf{X}^T\). By rank-nullity the dimension of this space is thus \(n - p\); in this particular instance, it is \(3 - 2 = 1\).