Key to HW4: Simulation

We can of course only talk about these things in generalities, since every body was working with different data, and different models. Here are the key points that should be noticed.

Problem 1) Lioness data:

We know that the true model is quadratic. That is how we specified it. When we examine the residuals of the linear model we should see fairly clearly that the linear model systematically overpredicts the range of young lionesses, underpredicts the range of middle aged lionesses, and again overpredicts the territorial range of older lionesses. This trend in the residuals looks like an x-squared term. When we put the x-squared term into the model the trend in the residuals goes away.

Note that with increasing sample size the standard error of the estimate (the root mean square error) which is an estimate of the true error standard deviation does not change very much, but the standard errors for all the parameters will decrease while the test statistics increase and become more significant. Increasing the sample size makes it easier to distinguish the signal (the true relationship) from the noise.

Increasing the amount of noise added to the deterministic portion of the model will change the root mean square error which estimates that noise standard deviation, and will also increase the uncertainty about our parameters. Our confidence intervals for the parameters and for predictions will be wider. With more noise in the data it is more difficult to see the signal.

Problem 2) Your own dataset:

When you know the model it is interesting to experiment by adding bogus predictors, and seeing if you can tell if they are bogus. When you add a bogus predictor the augmented model will explain at least as much of the unexplained variance in your response as the true model, but a trade off is that you lose degrees of freedom, and statistical power. Another trade off is that predictors that don't belong in your model will tend to make the model predictions overconfident. This is called overfitting.

Most of the time your hypothesis tests on the true slope of the bogus predictor will not lead to rejection of the hypothesis that the bogus variable has no linear effect on the response, but it will happen that a variable that doesn't belong in the model might test as a significant predictor. Conversely, sometimes a variable that you know belongs in your model will test as insignificant. Your significance level and the amount of statistical power you have will tell you how often these things happen.

Problem 3) Multicollinearity:

Generating your own x values as a linear function of each other (with some noise) guarantees that your predictors will be collinear. If x2 is a linear function of x1 plus a small amount of noise then x1 and x2 will be highly collinear, and the estimated plane will teeter about this ridge. This will be evidenced by a significant whole model F-test and insignificant individual t-tests. The Variance Inflation Factors will be large too.

When you increase the amount of noise between the predictors the multicollinearity lessens, but it won't entirely go away. If you also experimented with multicollinearity and large sample sizes then you could have noticed that the effect of variance inflation due to the multicollinearity is counteracted by the large sample size. If you also used just one of the xi's to predict y then, you may have noticed that the slopes can change quite radically. This is due to the fact that both variables are giving much the same information, and in the model in which they both appear they share the burden of predicting the response, but when they appear individually they have to do double duty.

General comment about statistical power:

One overall thing to note is that sample size is power - power to reject false null hypothesis - power to extract the signal from the noise - power to see through the errors into the nature of the relationships between your variables. A very important question is how much data do you need to be able to see the effects that you expect? Such questions can get very difficult with even only moderately complex models. Simulation is a beautiful tool to explore the question of sufficient sample size.