Solutions to 10 September lab assignment.
- Histograms of alligator length and witdh may vary. They should both show the right-skewed nature of the distributions.

- The mean and median alligator length are, respectively, 84.96 inches and 85 inches. The mean and median alligator weight are, respectively, 108.2 pounds and 80 pounds. (It is generally the case that the mean is higher than the median in a right-skewed distribution, since the median is more robust, meaning it better resists the influence of the high-end outliers.)
- The four graphs below show, on the left, the regression model and residual plot for length versus width, and on the right, the regression model and residual plot for log(length) versus log(weight). The r-squared values for these two models are, respectively, 0.8361 and 0.9449.

- The linear regression on the logs of length and weight is better than that on length and weight. We know this partly because r-squared is higher, but that should not be trusted too much (see comment below). A better indicator is that the residuals in the log-log plot show much less of a pattern.
- Why should we not trust r-squared? It can be very misleading. Consider the following three examples.
- First imagine a scatterplot in which the data appear almost perfectly linear, very beautifully so, but with a slight curvature in the plot. The r-squared value would be very close to 1.00 because the data were so linear. But a residual plot would dramatically show the curvature that had not been captured by the linear model. In this case, the linear model would not be bad for predicting y from x, but one wouldn't want to stop with a linear model if one was trying to understand a phenomenon. The curvature in the residuals would indicate that something more was going on.
- Now imagine a scatterplot in which all but one data point cluster in a structureless blob, with a single data point being far, far away from the rest in both x and y: an extrme outlier. In that case, the r-squared value would be very close to 1.00, even though the bulk of the data show no correlation whatsoever! This is because r-squared is very sensitive to outliers. It is not robust. So don't ever stop with r-square close to 1.00 and say, "Gee, I've got a great model!" You have to look at the data and see what story they're telling!
- Finally, imagine a case in which the data for a blobby ellipse with clear correlation but a lot of variability. In that case, the r-square value would be substantially lower than 1.00, perhaps around 0.6 or 0.7. But the scatterplot would likely show no clear pattern to the residuals, so we'd know that while a linear model wasn't a terrific predictor of y from x, nevertheless it might be the best possible model for the given data. Moral: high r-squared doesn't always mean a great model, and low r-squared doesn't always mean a poor model.
- Here's another comment about judging a model. It is not the case in the alligator problem that smaller residuals in the second model indicate a better model. Think about it: the units of length and log(length) are not comparable. Saying that the log-log plot has smaller residuals is like saying that a kilometer is smaller than an hour.
- If you used the natural logarithm, which in Matlab is the function log( ), then your preliminary model would be log(W)=-10.1746+3.286log(L). If you used the base 10 logarithm, which in Matlab is the function log10( ), then your preliminary model would be log10(W)=-4.4188+3.286log10(L). in either case, when you solve for W and simplify, you get W=(3.8E-5)*L^(3.286), where E indicates scientific notation. (Incidentally, the power of about 3 is typical of models relating a linear measurement like length and a volumetric measurement, to which weight would be approximately proportional. It should not be hard to see why. Kudos to those handful of students that went the extra mile in their lab reports and showed a scatterplot of the original data along with this power function model.