We will analyze data on the heights in inches of the singers in the New York Choral Society in 1979. The data are grouped according to eight voice parts; the vocal range of each voice part decreases in pitch in the order in which the voice parts are given in the table. Soprano 1 is highest, soprano 2 is the next highest, and so forth. The four highest voice parts are female voices and the four lowest parts are male voices. We are trying to see the relationship between voice part and height and get a feel for how strong the relationship is.
First, read in the data:
sop1 <- c ( 64, 62, 66, 65, 60, 61, 65, 66, 65, 63, 67, 65, 62, 65, 68, 65,
63, 65, 62, 65, 66, 62, 65, 63, 65, 66, 65, 62, 65, 66, 65, 61,
65, 66, 65, 62 )
Instead of typing all those numbers, you can cut and paste
sop2 <- c ( 63, 67, 60, 67, 66, 62, 65, 62, 61, 62, 66, 60, 65, 65, 61, 64,
68, 64, 63, 62, 64, 62, 64, 65, 60, 65, 70, 63, 67, 66 )
alto1 <- c ( 65, 62, 68, 67, 67, 63, 67, 66, 63, 72, 62, 61, 66, 64, 60,
61, 66, 66, 66, 62, 70, 65, 64, 63, 65, 69, 61, 66, 65, 61 )
alto2 <- c ( 70, 65, 65, 65, 64, 66, 64, 70, 63, 70, 64, 63, 67, 65, 63,
66, 66, 64, 64, 70, 70, 66, 66, 66, 69, 67, 65 )
ten1 <- c ( 69, 72, 71, 66, 76, 74, 71, 66, 68, 67, 70, 65, 72, 70, 68, 64,
73, 66, 68, 67, 64, 63, 64, 67, 66, 68 )
ten2 <- c ( 68, 73, 69, 71, 69, 76, 71, 69, 71, 66, 69, 71, 71, 71, 69, 70,
69, 68, 70, 68, 69 )
bass1 <- c ( 72, 70, 72, 69, 73, 71, 72, 68, 68, 71, 66, 68, 71, 73, 73, 70,
68, 70, 75, 68, 71 )
bass2 <- c ( 72, 75, 67, 75, 74, 72, 72, 74, 72, 72, 74, 70, 66, 68, 75, 68,
70, 72, 67, 70, 70 )
# Make histograms of heights, one histogram for each voice
par ( mfrow=c(4,2) )
hist(sop1)
hist(sop2)
hist(alto1)
hist(alto2)
hist(ten1)
hist(ten2)
hist(bass1)
hist(bass2)
# It's hard to compare those histograms because the x-axes are different.
# We need the x-axis to run from about 58 to 76; so try this:
hist(sop1,xlim=c(58,76))
hist(sop2,xlim=c(58,76))
hist(alto1,xlim=c(58,76))
hist(alto2,xlim=c(58,76))
hist(ten1,xlim=c(58,76))
hist(ten2,xlim=c(58,76))
hist(bass1,xlim=c(58,76))
hist(bass2,xlim=c(58,76))
# The different sized bins can be distracting, so let's fix that too:
hist(sop1,xlim=c(58,76),breaks=seq(58,76,by=2))
hist(sop2,xlim=c(58,76),breaks=seq(58,76,by=2))
hist(alto1,xlim=c(58,76),breaks=seq(58,76,by=2))
hist(alto2,xlim=c(58,76),breaks=seq(58,76,by=2))
hist(ten1,xlim=c(58,76),breaks=seq(58,76,by=2))
hist(ten2,xlim=c(58,76),breaks=seq(58,76,by=2))
hist(bass1,xlim=c(58,76),breaks=seq(58,76,by=2))
hist(bass2,xlim=c(58,76),breaks=seq(58,76,by=2))
We could compare the histograms better if we could line them up in one
column instead of two. Try it. Begin with
par ( mfrow=c(8,1) )
and then make all the histograms again. Is that better or worse?
Each histogram takes up so much space that comparing a lot of them simultaneously is difficult. Another way to compare groups visually is with boxplots, which take less space than histograms and are useful when there are too many groups. Try this:
par ( mfrow=c(1,1) )
boxplot ( sop1, sop2, alto1, alto2, ten1, ten2, bass1, bass2,
names=c("s1", "s2", "a1", "a2", "t1", "t2", "b1", "b2") )
Each box shows the middle 50% of the data. The line in the middle of each box is the median. Most of the rest of the data are contained inside the "whiskers" but there is one unusually tall tenor2 shown by an outlying line.
From visual inspection, what is the relationship between vocal
part and voice? Which presentation shows the relationship better:
histograms or boxplots? Just considering the women (sopranos and altos),
is there a relationship between vocal part and voice? What if you
just consider the men? Will it help to compute the average height
for each group? Try this (mean is another word for average):
mean(sop1)
Compute the means of the other seven groups. Are they ordered the way you thought they would be?
Some groups seem to be more spread out than others. For example,
the first sopranos and second altos seem to have less spread than
the second sopranos and first altos. Let's compute the SD of
each group.
# Compute the mean height of the first sopranos and give it a name
sop1.mean <- mean ( sop1 )
# Now compute the SD of the first sopranos and give it a name
sop1.SD <- sqrt ( mean ( (sop1-sop1.mean)^2 ) )
If you want to see the SD, just type:
sop1.SD
Which of the eight groups do you think will have small SD's? Compute and print the other seven SD's to check your guess.
Let's finish with two more items. First, we can enhance the histograms by adding the mean and +/- 1SD to each plot. And second, we can check the rule of thumb about percentages within one SD of the average.
1.
par ( mfrow=c(4,2) )
hist(sop1,xlim=c(58,76),breaks=seq(58,76,by=2),style.bar="old")
points ( sop1.mean, 0, cex=2 )
points ( sop1.mean-sop1.SD, 0, cex=2 )
points ( sop1.mean+sop1.SD, 0, cex=2 )
Note 1: style.bar="old" draws the histogram without shading. That lets
us see the points more clearly.
Note 2: The "points" function adds points to an existing plot.
points(x,y) adds a point at coordinates x, y; cex=2
tells S-Plus to draw the point twice as big as normal.
Try drawing all eight histograms with the added points.
2.
To find out what percentage of first sopranos are within +/- 1 SD of their mean
we begin by setting the upper and lower limits:
upper <- sop1.mean + sop1.SD
lower <- sop1.mean - sop1.SD
Then we'll say that each soprano is "OK" if she falls between the limits:
OK <- ( sop1 > lower ) & ( sop1 < upper )
OK
In S-Plus, T(rue) and F(alse) can be treated like 1 and 0; so we can find the number of sopranos inside the limits by adding, and then convert to a percentage.
x <- sum(OK)
(x/length(sop1))*100
The answer is just over 69%. Try this for the other voice parts. Also check what percentage are within +/- 2 SD's of the average.
Quit S-Plus
Duke PC labs: Click the "x" at the upper right corner of the main
S-Plus window
Unix: type "q()" at the prompt
Last updated Sept. 8, 1999 by Michael Lavine