Key to HW6 on principal components
After taking the natural logarithms of floor and windowsill lead
levels, the four resulting eigenvalues were 2.6783, 0.8690, 0.2687,
and 0.1840. This means that the first principal component explained
2.6783 of the total variance. Since the variables are standardized,
the total variance in the model is equal to 4. Hence the first
principal component explains 2.6783/4 or 66.9583% of the total
variance, the second component explains 0.8690/4 or 21.726%,
etc.
Applying the eigenvalue 1 rule, only the fists principal
component should be kept. The idea behind the eigenvalue 1 rule is
that since each standardized variable has a variance of one, a
component with an eigenvalue < 1 accounts for less than a single variable’s
variation and is therefore useless for data reduction.
The
scores for the first principal component are obtained from
JMP. Overall airborne lead contamination was highest in August, 1980,
and lowest in March, 1981.
Umbilical lead level is regressed on the first principal component
(PC). The measure of goodness of fit, measured by R-squared is 0.37,
indicating that the first principal component explains 37% of the
variation in umbilical lead levels. The t test shows that the
principal component is statistically significant at the <=0.05 level
since 0.0202<0.05. The coefficient on the principal component
indicates that a one unit change in the PC will lead to a 0.029 change
in the umbilical lead level.
Two diagnostic checks on the regression are undertaken. The first is a
Durbin Watson test for autocorrelation. The reason we should be
worried about autocorrelation here is that the data is
time-series. Hence if airborne lead levels are high in August, it is
likely that they will also be high in July and/or September. Recall
that if there is autocorrelation in the model (i.e. the errors are
correlated with each other) then the efficiency of OLS is reduced, and
the estimated standard errors will be biased. The Durbin Watson
statistic obtained here is 1.24. Hence since d<2, there is positive
autocorrelation. To test whether this is significant we compare d with
critical values (Table A4.4) for n and k-1 degrees of freedom. Here
n=14 and k-1=1, therefore d(lower)=1.08 and d(upper)=1.36. Since
d(lower) < d < d(upper) the test is inconclusive and we do not know
whether the null hypothesis should be rejected.
The second
diagnostic check is to examine the residual versus predicted Y
plot. The error terms seem to be independent hence there are no signs
of autocorrelation. Overall it looks like the plot is “all
clear”.
Question
4:
There are three criteria we could use in determining how
many PCs we should retain:
1) subjective judgment; 2)
eigenvalue-1 rule; 3) scree graph.
Here the result of using
2) is consistent with that of using 3): the first two PCs should be
retained.
The first two PCs’ eigenvalues are larger than
1; and in the scree graph, the leveling off begins after component
3(comparing the first two PCs, components 3 through 7 actually account
for relatively little additional variance).
Although the judgment here is always subjective, we don’t think
retaining 3 PCs, like some students suggested, makes sense. After
adding component 3, the total variance explained increases from 69% to
80%; but this extra 11% contribution from component 3 is much less
than that from PC1(38%) and PC2(31%); and actually, if you decide to
retain 3 PCs here, then you will find later on, in your rotation,
variable LEAVES almost loads equally on both PC1 and PC3, which makes
the interpretations difficult.
And also straight lines should
be used to link points in the scree graph. Some students use smooth
curves, which is not conventional and also not good for comparing the
steepness of the drop of the eigenvalues.
Question
5:
Majority did a good job in the confirmation of the
relationships among the outputs.
Question
6:
Some are confused about how to do the interpretation
after the PC analysis.
The method is:
We decide
to retain the first two PCs and do the rotation and get rotated
loading matrices(rotated factor pattern in JMP output). In each row
(i.e. for each variable, Depth, Width, …etc. in this case), mark
the loading with the highest absolute values. Then you can see that
VELOCITY, SAND(positively), SILT, LEAVES(negatively) load mainly on
factor 1; and DEPTH, WIDTH and MUD load strongly on factor 2. Although
not precise, we could call factor 1 a “speed & bottom”
dimension, and factor 2 a “volume” dimension.
Factor
1, “speed & bottom” are uncorrelated to Factor 2,
“volume”.
A stream with a high score on factor 1 must
be the one with a high velocity, a lot of sand and little silt and
leaves on the bottom. A stream with a low score on factor 1 is the one
with little sand and a lot of silt and leaves on the bottom and flows
slowly. As some students pointed out in the HW, it makes sense that if
a stream flows quickly, it tends to flash away the silt and leaves
easily but sand is harder to be flashed and tends to stay on the
bottom.
A stream with a high score on factor 2 is a deep,
wide and muddy one. And a stream with a low score on factor 2 is a
shallow, narrow and clear one. It also makes sense
intuitively.