This lab analyzes a data sets from Weisberg's book
Applied Linear Regression.
The data set contains the average brain weight and body weight
for 62 species of mammals. Body weight is in kilograms; brain weight
is in grams. Begin by entering them into S-Plus.
body.wt <- c ( 3.385, 0.480, 1.350, 465.000, 36.330, 27.660, 14.830, 1.040,
4.190, 0.425, 0.101, 0.920, 1.000, 0.005, 0.060, 3.500, 2.000, 1.700,
2547.000, 0.023, 187.100, 521.000, 0.785, 10.000, 3.300, 0.200, 1.410,
529.000, 207.000, 85.000, 0.750, 62.000, 6654.000, 3.500, 6.800, 35.000,
4.050, 0.120, 0.023, 0.010, 1.400, 250.010, 2.500, 55.500, 100.000, 52.160,
10.550, 0.550, 60.000, 3.600, 4.288, 0.280, 0.075, 0.122, 0.048, 192.000,
3.000, 160.000, 0.900, 1.620, 0.104, 4.235 )
brain.wt <- c ( 44.500, 15.499, 8.100, 423.000, 119.500, 115.000, 98.200,
5.500, 58.000, 6.400, 4.000, 5.700, 6.600, 0.140, 1.000, 10.800,
12.300, 6.300, 4603.000, 0.300, 419.000, 655.000, 3.500, 115.000,
25.600, 5.000, 17.500, 680.000, 406.000, 325.000, 12.300, 1320.000,
5712.000, 3.900, 179.000, 56.000, 17.000, 1.000, 0.400, 0.250,
12.500, 489.997, 12.100, 175.000, 157.000, 440.000, 179.500, 2.400,
81.000, 21.000, 39.200, 1.900, 1.200, 3.000, 0.330, 180.000,
25.000, 169.000, 2.600, 11.400, 2.500, 50.400 )
names <- c ( "Arctic fox", "Owl monkey", "Mountain beaver", "Cow",
"Gray wolf", "Goat", "Roe deer", "Guinea pig", "Vervet", "Chinchilla",
"Ground squirrel", "Arctic ground squirrel","African giant pouched rat",
"Lesser short-tailed shrew", "Star-nosed mole", "Nine-banded armadillo",
"Tree hyrax", "N. American opossum", "Asian elephant", "Big brown bat",
"Donkey", "Horse", "European hedgehog", "Patas mondey", "Cat", "Galago",
"Genet", "Giraffe", "Gorilla", "Gray seal", "Rock hyrax", "Human",
"African elephant", "Water opossum", "Rhesus mondey", "Kangaroo",
"Yellow-bellied marmot", "Golden hamster", "Mouse", "Little brown bat",
"Slow loris", "Okapi", "Rabbit", "Sheep", "Jaguar", "Chimpanzee",
"Baboon", "Desert hedgehog", "Giant armadillo", "Rock hyrax", "Raccoon",
"Rat", "Eastern American mole", "Mole rat", "Musk shrew", "Pig",
"Echidna", "Brazilian tapir", "Tenrec", "Phalanger", "Tree shrew",
"Red fox" )
Plot brain weight against body weight.
plot ( body.wt, brain.wt, xlab="Body weight (kg)", ylab="Brain weight (g)" )
It is difficult to see a relationship between brain weight and body weight because most of the points are clustered in the lower left corner. And that's because the scale needs to be large enough to accomodate the two points near the top. We can find out what those two species are with the identify(x,y,l) command. x,y give the locations of the points; l gives the labels. Click the left mouse button on your plot to identify a point. When you're done, click a different mouse button (3-button mouse) or both mouse buttons (2-button mouse) to get out of identify mode. (Which button to click varies from computer to computer. For me it's the middle button. Try and find what works for you.)
identify ( body.wt, brain.wt, names )
Of course they're the elephants! We should have guessed that.
The elephants, by their size, are obscuring the relationship between brain weight and body weight in the other species. To see the other species more clearly we could replot, omitting the elephants. To do that, we'd like to know what positions in our list are occupied by the elephants. We could look, or we could use the identify() command again, this time printing the species number instead of species name. Click on the elephants to find their numbers.
identify ( body.wt, brain.wt, 1:62 )
The Asian elephant is 19; the African elephant is 33. To plot the animals and omit the elephants, we use their subscripts with negative numbers.
plot ( body.wt[-c(19,33)], brain.wt[-c(19,33)], xlab="Body weight (kg)", ylab="Brain weight (g)" )
That helped some; but there is still a cluster of points in the lower left corner. Another approach to seeing the whole data set more clearly is to take logs and then replot.
body.log <- log ( body.wt )
brain.log <- log ( brain.wt )
plot ( body.log, brain.log, xlab="log (Body weight)",
ylab="log (Brain weight)" )
That looks much clearer! You might want to use the identify() command again. Remember to modify it to use logs instead of the original weights.
Instead of taking logs ourselves, we could have asked S-Plus to plot the original data on log scales. In case you're interested, here's how. The part log="xy" tells S-Plus to put both the x and y axes on log scales.
plot ( body.wt, brain.wt, xlab="Body weight (kg)", ylab="Brain weight (g)", log="xy" )
Is it scientifically legitimate to take logs? How do we know when or whether to take logs or do some other manipulation to the data? The answer is that we manipulate to help us discover relationships in the data. In this example taking logs helped us see a strong relationship that we might otherwise have missed. It's perfectly legitimate to look at your data in as many ways as possible to see what you can discover.
Because there's a strong linear relationship in the data I want to plot a line on our scatter diagram. Begin by recreating the plot of logged data.
plot ( body.log, brain.log, xlab="log (Body weight)", ylab="log (Brain weight)" )
Now let's find the SD line. We need to know the averages and SD's of x and y.
x.ave _ mean(body.log)
y.ave _ mean(brain.log)
x.sd _ sqrt ( mean ( (body.log-x.ave)^2) )
y.sd _ sqrt ( mean ( (brain.log-y.ave)^2) )
The SD line goes through the point (x.ave,y.ave) and has slope y.sd/x.sd. S-Plus can plot a line if we tell it the slope and intercept. We know the slope and just need to find the intercept using the formula y.ave - slope*x.ave. Then we use the abline() command to plot it.
slope _ y.sd/x.sd
intercept _ y.ave - slope*x.ave
abline ( intercept, slope )
Species that are far above the line have large brains relative to their bodies. Species far below the line have small brains relative to their bodies. What species are these? Use the identify command to find out. Are you pleased with where humans are?
identify ( body.log, brain.log, names )
Now let's play some games to see how the correlation coefficient can change when you look at different parts of the data set. First we get the correlation coefficient for the whole data set.
cor(body.log,brain.log)
What if he hadn't taken logs?
cor(body.wt,brain.wt)
The correlation is very high, but doesn't mean much because the original points don't lie clustered around a line. They are more like one big cluster in the lower left and two isolated pachyderm points. Let's stick with the log scale where the correlation better represents the data. FPP say that if we restrict attention to the species with body.log in a narrow range, we should get a smaller r. To see how much smaller, we'll divide the species into three groups according to whether body.log is (i) less than -2, (ii) between -2 and 2, and (iii) greater than 2. Then we'll compute r separately for each group. In addition to cutting and pasting the following commands you could look at them closely to see whether you understand them. Could you do something similar on your own if you had to?
cor(body.log[body.log < -2],brain.log[body.log < -2])
cor ( body.log [ body.log > -2 & body.log < 2 ],
brain.log [ body.log > -2 & body.log < 2 ] )
cor(body.log[body.log>2],brain.log[body.log>2])
All these correlations are fairly high. That's because even within each subgroup the scatter above and below the line is small compared to how much the line goes up and down. But if we restrict the range of body.log even further, then the correlation goes down further. For example:
cor ( body.log [ body.log > -1 & body.log < 1 ],
brain.log [ body.log > -1 & body.log < 1 ] )
What happens if we pick a sample of points that are spread evenly through the full range of body.log? Use the sample() command to choose a sample. I will choose a sample of size 30 from our 62 species and compute the correlation between body.log and brain.log just for those 30 species.
mysamp _ sample ( 1:62, 30 )
cor ( body.log[mysamp], brain.log[mysamp] )
Repeat the previous two commands several times. You will get a different sample each time and a different correlation coefficient. Notice how much or how little they vary and how close or how far they are from the correlation we got using the whole data set.
Finally, what would happen if we used just the extreme high and low values of body.log? Try it. The vertical bar "|" is the S-Plus symbol for "or".
cor ( body.log [ body.log < -3 | body.log > 7 ],
brain.log [ body.log < -3 | body.log > 7 ] )
Can you see on the scatter diagram which points we are using? Do you see what happens to the correlation coefficient?