HOMEWORK 5
1. Most people did a good job in this question, except for forgetting about interpreting the result of the F-test for the whole model. Usually it is a kind of routine to interpret the coefficient estimates, R2, t-test results and F-test result for the whole model. And especially here, you are required to answer the question about the significance of either predicator, checking with F-test is necessary here after t-test checking; and furthermore, the F-test checking has its special meaning here.
As we see, either of the predictors is significantly different from 0 at a = 0.05 according the t-tests; however in the whole model F-test of the hypothesis, H0: b 1= b 2 = 0, the p-value is 0.0288 < 0.05(if we set a =0.05), therefore we have to reject H0, which means at least one of the predictors has significant effect on PRODUCTION. We could conclude that there might be multicollinearity (actually collinearity here, since we only have two predictors in the model), combining the t-test and F-test results.
2. We think that there are three problems presented in the residual plots: potential outliner, curvilinearity and autocorrelation. The point at the right hand side well below the mean of residual has an extremely high residual, which is a candidate of outlier. The other points form a U-shaped pattern about the mean of residual; although we only have 17 observations here, which makes this conclusion a little bit weak, the shape is somehow obvious if you combine the study of data( PRODUCTIONs increase with the increase in AREA, however at a decreasing rate). We can see that, except for the potential outlier, the other observations clump into three groups. At JMP, if you click the points in the first two groups on the left hand side, you will find that these points clumping together come from sequential years; or you could just go back and take a look at the data, you will find that, from 1961 to 1965 & from 1969 to 1971, the data of AREA and PRODUCTION are exactly the same. It is saying that, for the observations in these two groups, given the PRODUCTION of this year, you could predicate the PRODUCTION of the next year exactly, they are not independent.
Some people talked about heteroschedasticity We don't think it is the problem here. If you ignore the two groups of observations with clumping residuals, the residuals of other observations seem to spread out evenly; and the small sample size does not give you any confidence to talk about inconstant variance either.
Some people talked about the violation of normality assumption. Non-normality is not obvious from this small sample size study; you should do some test, e.g. Q-Q plot, before you make any conclusion about normality.
3. The concepts of outlier, leverage and influential cases are totally different; don't mix them up. And the tests of studentized residuals, h(ii), and Cook's D are used to identify these cases respectively. Most of the answers to this question are good.
4. The Durbin Watson test for autocorrelation gives a value of 1.0867. Since this is less than 2, it suggests that there is positive autocorrelation in the data. To test whether this is significant, we need to use the appropriate decision rule, and compare the DW statistic with the critical values for n and k-1 degrees of freedom. (see page 355 in textbook). Since dl = 1.02 and du = 1.54, our DW statistic lies in between these values hence the test is inconclusive.
The plot of residuals against year shows signs of autocorrelation. There is a trend in the residuals whereby one point is closely associated with its neighbors. This pattern is especially prevalent at lower levels of predicted production.
5. The regression of production on year alone and then on area alone indicate that each independent variable is statistically significant (since p values are less than 0.05). In contrast, the whole model test indicated that neither of these variables are statistically significant in explaining the variance in production. This is a principal symptom of multicollinearity.
a) The matrix of correlations among the coefficient estimates give values that are greater than ( 0.9 which indicates that multicollinearity is present. In this case, since we only have 2 independent variables, we CAN identify which variables are involved.
b) The R-squared k value obtained for the regression of year on area and area on year is the same for both models and is equal to 0.9067. This implies that 90.67% of the variation in year is shared with the other X variable (area), and vice versa. Since R-squared k = 1 implies perfect multicollinearity, a value of 0.9067 implies that there is strong multicollinearity among our X variables.
The simple correlation between area and year (which is NOT the same as the correlation on coefficient estimates) is 0.9522. The square of 0.9522 will give you the R-squared k value. Note that this is only true when you compare the simple correlation with SINGLE regression analysis (i.e. when you only have one X variable in your model).
6. The big issues we addressed in questions 2-5 were residual patterns, "case statistics", autocorrelation, and multicollinearity, and we had indications that each might be a problem for this regression. For question six, we need to specifically address how these problems with the regression affect our interpretation of the Casava data (and, ideally, what we might do about this). Instead of just saying, "the assumptions of OLS regression were violated", say WHICH assumptions were violated by WHICH problems (cf. RWG: 110-113). Also, instead of lifting relavent sentances from the book, see if you can express the problem in your own words. That's the way to really learn this stuff!
First, patterns in the residual plot are clearly worth worrying about, though people picked out different aspects. In my opinion, curvilinearity and a possible influential case (RWG: 53) are the most obvious obstacles to an "all clear" residual plot. Curvilinearity implies that we are using a straight line to model a more complicated relationship. We should really plot residuals against individual predictor variables (instead of Y-hat) to see whether a quadratic term would help the model, for instance. This is a case of using the wrong model, which is quite different from the problem of the outlying point. We'll get to that point in the next paragraph. Note that neither of these problems implies anything about the NORMALITY of the residuals. To check that, we'd want to do a Q-Q plot, but, with 14 data points, we have pretty sparse data for detecting such a pattern with certainty.
The case statistics highlighted two potential concerns. The high leverage point is worth noting, because it indicates that the strength of the relationship we are describing is highly sensitive to the value of this point. The influential point is more worrisome. LOOK AT THE RAW DATA: Why is this point so weird? My guess is that there was some factor other than cleared land and year (maybe drought????) that influenced casava production that year. Since we don't really know, we can only suggest that running the analysis again without this point might give different results.
Autocorrelation is the big problem in my opinion. The D-W test was inconclusive, but LOOK AT THE DATA: Clearly the data from 1961-1971 are not independent of each other and they're probably not independent after that. Non-independent residuals are a big no-no for OLS regression, because you can't trust the usual parametric hypothesis tests (t-tests, F-tests CI's, etc.). One way to deal with autocorrelation is using the autoregressive approach we used in the stock market question.
Multicollinearity is obviously also a big deal in this regression. This means that our estimates of the Betas are unreliable due to inflated SE's. (Interestingly, the estimated Y-hats aren't affected by multicollinearity, so you could interpret those reliably) People suggested that including two variables that explain the same variabce is unnecessary, and we could just drop one. That sounds reasonable to me.
OVERALL: We have seen that we can't accept the results of this regression at face value, BUT I'd say we have learned a lot about casava production by examining the data in a regression context. We suspect that there are some issues in data collection (what WAS going on from 1964-1971??). We see that cultivation generally increases over time and that this generally results in more production, though there are some interesting exceptions (what's up with 1982?). We suspect that the relationship might not be fit best by a straight line. SO, even though we can't trust our regression, we see that a formal statistical analysis can be a useful way of exploring our data.
STOCK MARKET PROBLEM
People generally did a nice job observing and describing the autocorrelation in this data: In both the scatterplots and the correlation matrix, the first order lags are highly correlated with the data (r = 0.81) and this strong autocorrelation falls off at higher order lags. As an aside: It's good to have an idea of what correlations of .8, .6, .4, etc. look like, so get an idea of what "highly correlated" means by looking at your scatterplots.
Adding the t-1 lag of DJIAclose to the regression is a way of explicitely modelling 1st order autocorrelation in your regression model. There are two important things to notice about the results of this analysis. First, to check to see whether there is still autocorrelation in the model it is not enough to just plot (or get the correlation for) the residuals vs. the 1st order lag of those residuals. Remember that you also had higher order autocorrelation in the original regression, so you need to check to see if that went away too. If you plot resids vs. lag1, lag2, lag3, etc. like you did the first time, you notice that autocorrelation has, in fact, disappeared at all the lags. But how did adding only the first order lag in DJIAclose remove higher order autocorrelation???!!! The answer is that all the autocorrelation that you saw decreasing at higher order lags in the first part was really just an artifact of first order autocorrelation. In other words, the strong effect of today's DJIAclose on yesterday's DJIAclose has the indirect effect of making a somewhat weaker correlation between today's DJIAclose and the day before yesterday's DJIAclose (through the mechanism that yesterday's DJIAclose is correlated with the day before yesterday's DJIAclose as a 1st order lag). This is hard to say, but it makes sense. This is the answer to George's question: What happened to the autocorrelation?
Second, think about what the new regression means. Many people were happy that the autocorrelation disappeared and that the R^2 improved, but they didn't notice that the "day" term in the model was no longer significant. If our idea is to predict the stock market, we have learned that once you take out the 1st order autocorrelation in the model, you can't predict the long term trend in the market with this analysis.