Key to HW4: Simulation
We can of course only talk about these things in generalities, since
every body was working with different data, and different models.
Here are the key points that should be noticed.
Problem 1) Lioness data:
We know that the true model is quadratic. That is how we specified
it. When we examine the residuals of the linear model we should see
fairly clearly that the linear model systematically overpredicts the
range of young lionesses, underpredicts the range of middle aged
lionesses, and again overpredicts the territorial range of older
lionesses. This trend in the residuals looks like an x-squared term.
When we put the x-squared term into the model the trend in the
residuals goes away.
Note that with increasing sample size the standard error of the
estimate (the root mean square error) which is an estimate of the true
error standard deviation does not change very much, but the standard
errors for all the parameters will decrease while the test statistics
increase and become more significant. Increasing the sample size
makes it easier to distinguish the signal (the true relationship) from
the noise.
Increasing the amount of noise added to the deterministic portion of
the model will change the root mean square error which estimates that
noise standard deviation, and will also increase the uncertainty about
our parameters. Our confidence intervals for the parameters and for
predictions will be wider. With more noise in the data it is more
difficult to see the signal.
Problem 2) Your own dataset:
When you know the model it is interesting to experiment by adding
bogus predictors, and seeing if you can tell if they are bogus. When
you add a bogus predictor the augmented model will explain at least as
much of the unexplained variance in your response as the true model,
but a trade off is that you lose degrees of freedom, and statistical
power. Another trade off is that predictors that don't belong in your
model will tend to make the model predictions overconfident. This is
called overfitting.
Most of the time your hypothesis tests on the true slope of the bogus
predictor will not lead to rejection of the hypothesis that the bogus
variable has no linear effect on the response, but it will happen that
a variable that doesn't belong in the model might test as a
significant predictor. Conversely, sometimes a variable that you know
belongs in your model will test as insignificant. Your significance
level and the amount of statistical power you have will tell you how
often these things happen.
Problem 3) Multicollinearity:
Generating your own x values as a linear function of each other (with
some noise) guarantees that your predictors will be collinear. If x2
is a linear function of x1 plus a small amount of noise then x1 and x2
will be highly collinear, and the estimated plane will teeter about
this ridge. This will be evidenced by a significant whole model
F-test and insignificant individual t-tests. The Variance Inflation
Factors will be large too.
When you increase the amount of noise between the predictors the
multicollinearity lessens, but it won't entirely go away. If you also
experimented with multicollinearity and large sample sizes then you
could have noticed that the effect of variance inflation due to the
multicollinearity is counteracted by the large sample size. If you
also used just one of the xi's to predict y then, you may have noticed
that the slopes can change quite radically. This is due to the fact
that both variables are giving much the same information, and in the
model in which they both appear they share the burden of predicting
the response, but when they appear individually they have to do double
duty.
General comment about statistical power:
One overall thing to note is that sample size is power - power to
reject false null hypothesis - power to extract the signal from the
noise - power to see through the errors into the nature of the
relationships between your variables. A very important question is
how much data do you need to be able to see the effects that you
expect? Such questions can get very difficult with even only
moderately complex models. Simulation is a beautiful tool to explore
the question of sufficient sample size.