1. Prove Theorem 6 in the notes regarding the best \(q\)-dimensional affine approximation to \(Y\) (Hint: mimic the proof of Theorem 4).

  2. Recall the definition of the sample covariance matrix between two sets of variables: If \(X\in \mathbb R^{n\times p}\) and \(Y \in \mathbb R^{n\times q}\) then we define \(Cov[X, Y]\) to be the \(p\times q\) matrix with \(j,k\)th element given by the sample covariance of columns \(j\) and \(k\) of \(X\) and \(Y\) respectively. Let \(C\) be the \(n\times n\) centering matrix. Show that \[ Cov[ X, Y] = Cov[ CX,Y] = Cov[X,CY]= Cov[CX,CY]. \]

  3. In PCA we might want to know how the original variables \(Y\) are correlated with the new variables, the principal components \(F\).
    Find a simple formula for \(Cov[Y,F]\) in terms of the eigenvalues and vectors of \(S = Y^\top C Y\). Find a slightly more complicated formula for the correlation matrix between \(Y\) and \(F\).

  4. The dataset UNComtrade.rds contains a matrix \(Y\) of the total trade volume in log-dollars between \(n=148\) countries for the 10 year period 2001-2010. Specifically, \(y_{i,j}\) is the total trade volume exported from country \(i\) to country \(j\) during this period.

    1. Compute the best rank-2 approximation to the matrix \(Y-11^\top \bar y\) where \(\bar y = 1^\top Y 1/n^2\), the scalar average of all \(n^2\) entries of the matrix. On a single figure, plot the first two left singular vectors and the first two right singular vectors, using the country names as plotting symbols, and colors to distinguish the left singular vectors from the right. So, each country should be represented with two points on the plot - \((u_{i,1},u_{i,2})\) in one color and \((v_{i,1},v_{i,2})\) in another color. Describe how the plot can be used to identify pairs of countries with a high trade volume.
    2. Repeat part a. but using the SVD of \(CY\) where \(C\) is the \(n\times n\) centering matrix. Describe how the figure has changed, and explain how to interpret the \(U\) and \(V\) matrices.
    3. Repeat part a. but using the SVD of \(CYC\). Again, describe how the figure differs from those in a. and b. and explain how to interpret the \(U\) and \(V\) matrices.

Hints for problem 1:

  1. Mimicking the proof in theorem 4, show that the problem reduces to that of finding the best approximation to \(E\) of the form \(A W^\top\) where \(A\in \mathbb R^{n\times q}\) and \(W\in \mathbb R^{p\times q}\).

  2. Without loss of generality, assume \(W^\top W = I_q\). Now find the value of \(A\) that minimizes the sum of squared error, for a given \(W\).

  3. From 2, reduce the problem to one of finding the matrix \(W\) that maximizes the trace of \(W^\top S W\), where \(S=E^\top E\).

  4. Using the eigendecomposition of \(S=V\Lambda V^\top\), express the trace as a sum over values of \(\Lambda\) and the squared entries of the \(p\times q\) matrix \(X = V^\top W\). Note that \(X^\top X = I_q\), so in particular, the squared entries are between 0 and 1, the sum of the squared entries of the columns are 1 and the sum of all squared entries is \(q\).