Prove Theorem 6 in the notes regarding the best \(q\)-dimensional affine approximation to \(Y\) (Hint: mimic the proof of Theorem 4).
Recall the definition of the sample covariance matrix between two sets of variables: If \(X\in \mathbb R^{n\times p}\) and \(Y \in \mathbb R^{n\times q}\) then we define \(Cov[X, Y]\) to be the \(p\times q\) matrix with \(j,k\)th element given by the sample covariance of columns \(j\) and \(k\) of \(X\) and \(Y\) respectively. Let \(C\) be the \(n\times n\) centering matrix. Show that \[ Cov[ X, Y] = Cov[ CX,Y] = Cov[X,CY]= Cov[CX,CY]. \]
In PCA we might want to know how the original variables \(Y\) are correlated with the new variables, the principal components \(F\).
Find a simple formula for \(Cov[Y,F]\) in terms of the eigenvalues and vectors of \(S = Y^\top C Y\). Find a slightly more complicated formula for the correlation matrix between \(Y\) and \(F\).
The dataset UNComtrade.rds
contains a matrix \(Y\) of the total trade volume in log-dollars between \(n=148\) countries for the 10 year period 2001-2010. Specifically, \(y_{i,j}\) is the total trade volume exported from country \(i\) to country \(j\) during this period.
Hints for problem 1:
Mimicking the proof in theorem 4, show that the problem reduces to that of finding the best approximation to \(E\) of the form \(A W^\top\) where \(A\in \mathbb R^{n\times q}\) and \(W\in \mathbb R^{p\times q}\).
Without loss of generality, assume \(W^\top W = I_q\). Now find the value of \(A\) that minimizes the sum of squared error, for a given \(W\).
From 2, reduce the problem to one of finding the matrix \(W\) that maximizes the trace of \(W^\top S W\), where \(S=E^\top E\).
Using the eigendecomposition of \(S=V\Lambda V^\top\), express the trace as a sum over values of \(\Lambda\) and the squared entries of the \(p\times q\) matrix \(X = V^\top W\). Note that \(X^\top X = I_q\), so in particular, the squared entries are between 0 and 1, the sum of the squared entries of the columns are 1 and the sum of all squared entries is \(q\).