Key to HW6 on principal components

After taking the natural logarithms of floor and windowsill lead levels, the four resulting eigenvalues were 2.6783, 0.8690, 0.2687, and 0.1840. This means that the first principal component explained 2.6783 of the total variance. Since the variables are standardized, the total variance in the model is equal to 4. Hence the first principal component explains 2.6783/4 or 66.9583% of the total variance, the second component explains 0.8690/4 or 21.726%, etc.

Applying the eigenvalue 1 rule, only the fists principal component should be kept. The idea behind the eigenvalue 1 rule is that since each standardized variable has a variance of one, a component with an eigenvalue < 1 accounts for less than a single variable’s variation and is therefore useless for data reduction.

The scores for the first principal component are obtained from JMP. Overall airborne lead contamination was highest in August, 1980, and lowest in March, 1981.

Umbilical lead level is regressed on the first principal component (PC). The measure of goodness of fit, measured by R-squared is 0.37, indicating that the first principal component explains 37% of the variation in umbilical lead levels. The t test shows that the principal component is statistically significant at the <=0.05 level since 0.0202<0.05. The coefficient on the principal component indicates that a one unit change in the PC will lead to a 0.029 change in the umbilical lead level.

Two diagnostic checks on the regression are undertaken. The first is a Durbin Watson test for autocorrelation. The reason we should be worried about autocorrelation here is that the data is time-series. Hence if airborne lead levels are high in August, it is likely that they will also be high in July and/or September. Recall that if there is autocorrelation in the model (i.e. the errors are correlated with each other) then the efficiency of OLS is reduced, and the estimated standard errors will be biased. The Durbin Watson statistic obtained here is 1.24. Hence since d<2, there is positive autocorrelation. To test whether this is significant we compare d with critical values (Table A4.4) for n and k-1 degrees of freedom. Here n=14 and k-1=1, therefore d(lower)=1.08 and d(upper)=1.36. Since d(lower) < d < d(upper) the test is inconclusive and we do not know whether the null hypothesis should be rejected.

The second diagnostic check is to examine the residual versus predicted Y plot. The error terms seem to be independent hence there are no signs of autocorrelation. Overall it looks like the plot is “all clear”.

Question 4:

There are three criteria we could use in determining how many PCs we should retain:

1) subjective judgment; 2) eigenvalue-1 rule; 3) scree graph.

Here the result of using 2) is consistent with that of using 3): the first two PCs should be retained.

The first two PCs’ eigenvalues are larger than 1; and in the scree graph, the leveling off begins after component 3(comparing the first two PCs, components 3 through 7 actually account for relatively little additional variance).

Although the judgment here is always subjective, we don’t think retaining 3 PCs, like some students suggested, makes sense. After adding component 3, the total variance explained increases from 69% to 80%; but this extra 11% contribution from component 3 is much less than that from PC1(38%) and PC2(31%); and actually, if you decide to retain 3 PCs here, then you will find later on, in your rotation, variable LEAVES almost loads equally on both PC1 and PC3, which makes the interpretations difficult.

And also straight lines should be used to link points in the scree graph. Some students use smooth curves, which is not conventional and also not good for comparing the steepness of the drop of the eigenvalues.

Question 5:

Majority did a good job in the confirmation of the relationships among the outputs.

Question 6:

Some are confused about how to do the interpretation after the PC analysis.

The method is:

We decide to retain the first two PCs and do the rotation and get rotated loading matrices(rotated factor pattern in JMP output). In each row (i.e. for each variable, Depth, Width, …etc. in this case), mark the loading with the highest absolute values. Then you can see that VELOCITY, SAND(positively), SILT, LEAVES(negatively) load mainly on factor 1; and DEPTH, WIDTH and MUD load strongly on factor 2. Although not precise, we could call factor 1 a “speed & bottom” dimension, and factor 2 a “volume” dimension.

Factor 1, “speed & bottom” are uncorrelated to Factor 2, “volume”.

A stream with a high score on factor 1 must be the one with a high velocity, a lot of sand and little silt and leaves on the bottom. A stream with a low score on factor 1 is the one with little sand and a lot of silt and leaves on the bottom and flows slowly. As some students pointed out in the HW, it makes sense that if a stream flows quickly, it tends to flash away the silt and leaves easily but sand is harder to be flashed and tends to stay on the bottom.

A stream with a high score on factor 2 is a deep, wide and muddy one. And a stream with a low score on factor 2 is a shallow, narrow and clear one. It also makes sense intuitively.