The February 5th lab will be focused on questions and discussion related to your written assignment.
Some things to consider:
- The exploratory data analysis section is meant to give you the opportunity to discuss features of the data relevant to the regression analysis. You should discuss in some way the center and spread of the X and Y variables. You do not need to discuss every statistic; just highlight, using statistics, the features of these variables that are important to the subject matter and the research question. I suggest making histograms of the X and of the Y (not to be included in the writeup) and looking at them. Then use relevant summary stats to describe features that you see. X and Y alone are not of interest, however. Most importantly, you should use the scatterplot of the data to describe how Y might be associated with X. Unusual observations or outliers should be noted; be sure to define what you mean when you say something is an outlier -- does it appear on the scatterplot as a possibly unusual observation? If you call something an outlier, make sure you describe the context in which it looks unusual.
- You should not need to directly copy the language I used in my sample one-page summary. Your summary should be tailored to your particular dataset.
- For residual plots, there are 2 to make. (1) a normal prob plot and (2) a residuals vs. fitted plot. I don't expect you do do anything more than make the 2 plots and comment on them very briefly (2 sentences max on residual plots total)
- In (1), you are looking for observations to fall on roughly a straight line; this provides evidence that your assumption of normality of the population of mu{y|x} is probably reasonable. Remember that when points fall off the line, it may be just an artifact of sampling variability or we may have a distribution with longer or shorter tails than the normal or we may have some unusual observations worth checking into. Without further information about the problem, and without a large sample size, it can be difficult to attribute these types of observations to anything other than sampling variability. Problems with these plots come from systematic departures from the line or from points that are outliers (usually Splus will number them) ; you'll see more of this on Thursday.
- In (2), please omit the "smooth" line when you make your residual vs. fitted plot. As you move across values of X, you are looking for your observations to be evenly spread about a band above and below 0. Remember, we assume the errors of the model are mean 0 and equal variance. In my sample writeup, 2 points stood out in this plot, and I decided to check how influential they were on the regression line.
For this assignment, I don't expect you to transform variables. Just do your best to report on your data analysis (a simple regression of Y on X) in a concise manner.
Last modified: Tue Feb 4 23:40:11 EST 2003