Lab 2

Lab 2 Objectives:

Here we will learn how to use S-Plus to perform numerical summaries on our data. In particular, we will reproduce the results in Chapter 3, Section 3.5, Further Applications of the POB text. Also, for the energetic student, you can analyze 2 data sets from the Chapter 3 exercises. If you forget how to import data into S-Plus, or how to make/open a workspace, review lab 1 to refresh your memory.

Download the heart rate data shown in Table 3.5 of Section 3.5 of POB (or copy it from your CD). Import this into S-Plus (see lab one!). We will assume the name of the data set is heart.rates.

We'll use the Command Window to do some calculations. On the pull-down menus choose: Window>Commands Window so that there is a check mark next to Commands Window. Alternatively, press the toolbar button that looks like

>
>x

You should then see the "Commands" window on your screen. Notice the command prompt in S-Plus is ">". Type (and press return!)

> attach(heart.rates)

(don't type the ">"). This allow us to refer to the 2 variables (columns) in the heart.rates data.frame (a.k.a "data sheet" when using the GUI) without having to also type the name of the data.frame. Now calculate the mean heart rate:

> mean(rate)

There appears to be an outlier in the data set; 40 is much lower than the rest of the data (just look at the data set in the data sheet or create a box plot). Let's calculate the mean without the "outlier".

> mean(rate[-7])

The [-7] is an index saying "don't include the 7th data value" (i.e. the 40).

Let's to the same thing using the median:

> median(rate)
> median(rate[-7])

Remember that the median is not sensitive to outliers but the mean is. The median changes when we remove the outlier, but not as much as the mean. Also, if the outlier were even smaller, the mean would be pulled down even lower, but the median would still be the same.

Now let's calculate some measures of dispersion. First, try the range and interquartile range (IRQ)

> range(rate)
> diff(quantile(rate, prob=c(.25, .75)))

Notice that S-Plus calls percentiles "quantiles". Technically, S-Plus doesn't give the range, but Q1 and Q3. (Try diff(range(rate))) Also, you may notice a strange label (75%) for the IQR result. Ignore this; S-Plus tries to use labels that makes sense, but sometimes the results are not useful. You'll notice that we get a different answer than in POB. That's because S-Plus uses a different definition of percentiles than in our book. Play with the quantile() function. Ask your TA about the "prob=c(.25, .75)" argument if you're curious.

Quick, calculate the variance and standard deviation!

> var(rate)
> sqrt(var(rate))

Notice we can get most of these numerical summaries at once using the pull-down menu Statistics>Data Summaries>Summary Statistics.... Try it! Would you use the mean and standard deviation or the quartiles to summarize this data set?

Now, download the birth weights.xls file and import it into S-Plus. You may already have the data in S-Plus from lab1. We'll calculate the grouped mean and grouped variance as on pages 57 and 58 of POB. We'll assume the name of the data set is birth.weights.

First, we need to add a column to the data set. With the pull down menu do Insert>Column.... Type midpoint into the Names(s) field of the Insert Columns dialog box. Leave everything else as is. Click OK. Enter the midpoints of the classes into the midpoint column as in POB page 57: 249.5, 749.5, ..., 5249.5. Now go to the command window (see above on how to get there).

First, attach the birth.weights data.frame:

> attach(birth.weights)

Now, calculate the grouped mean:

> gmean<-sum(midpoint*number)/sum(number)
> gmean

Notice we used the "assignment operator" ("<-") to assign the result to a variable named gmean and then displayed the value by typing the name. Did you get the same result as in POB? If not, you probably typed in different midpoint value(s) than in POB. Let's try the grouped variance (using the gmean we just calculated):

> gvar<-sum(((midpoint-gmean)^2)*number)/(sum(number)-1)
> gvar

Your answer may differ slightly than in POB since we used more significant digits in gmean than using the constant 3348.2.

For the Energetic Student

For exercise 15, download the Excel file lowbwt.xls or copy it from your CD. Import the Excel file into S-Plus. (See lab 1 if you do not remember how to do this).

By default, S-Plus will read numerical values as double precision. Since the variable sex is a categorical variable, we need to change its data type. Go to the Data menu and select Change Data Type (at the bottom of the list). Select sex for the column to change, and then select Character for the New Data type. Click OK (or Apply to check the result first)

Side-by-side boxplots:

Before calculating anything, let's create side-by-side box-plots: Go to the Graph menu and select 2D Graphs > Boxplots (x, grouping optional). Remember that the variable that we are plotting goes on the Y-axis, so select sbp for the y-column. To have separate box-plots for each level of the variable sex, select sex for the x-Column. Change any options for the box plot by going to the Options tab, then select OK. Add a title, legend, and clean up any axes labels.

Without skipping ahead to the numerical summaries, what is the median systolic blood pressure for males? For females? Based on the plot, do you expect the mean to be close to the median for each group? Are there any outliers?

Numerical Summaries:

To obtain numerical summaries, we will use the Statistics menu. Under the Statistics menu, select Data Summaries, then Summary Statistics. In the dialog box, pick which variables that you want to have summarized, i.e. sbp. To get summaries for males and females separately, specify sex as the Group Variable. For other exercises in the chapter you can leave the Group Variable at the default setting of None.

The output will appear in a "Report" window. You can cut and past this into your favorite word processor along with the graphs. Can you figure out how to get S-Plus to give you the coefficient of variation for each gender? (This is not obvious.)

Trimmed Means:

To get started, download and import the unicef.xls data.

In exercise 12, you are asked to find the 5% trimmed mean. This is not available under the Statistics menu, but can be obtained using S-plus's Command Window as discussed above.

Recall data in S-Plus are stored as a data.frame. On the command line, if we want to refer to a variable within a data.frame we have two options: we can specify the dataframe/variable as unicef$lowbwt (dataframe-name$variable-name) where the $ separates the dataframe from the variable. In practice, this means more typing, and more chances for mistakes! The second approach is to attach the data.frame as we did previously.

> attach(unicef)

Now we can refer to variables in the dataframe unicef directly. Now type in the variable name lowbwt and hit enter. All the values stored in the variable lowbwt are displayed.

To find the 5% trimmed mean, enter after the prompt

>mean(lowbwt, trim=.05)

What did you get?

Note that in the unicef data.frame, there are many observations with a value NA, meaning Not Available or Not A number. In S-plus, we will often have to tell functions what to do in the case of NA's, such as remove them, otherwise the answer will be NA.

To take care of NA's, use

>mean(lowbwt,trim=0.05, na.rm=T)

Now what did you get?

The na.rm=Toption means the NA's should be removed. The default is to keep them in. Repeat without the trim option:

> mean(lowbwt, na.rm=T)

This should agree with the mean provided by the Summary Statistics menu, as the default here is trim=0 (no trimming).