Lab 2 Objectives:Here we will learn how to use S-Plus to perform numerical summaries on our data. In particular, we will reproduce the results in Chapter 3, Section 3.5, Further Applications of the POB text. Also, for the energetic student, you can analyze 2 data sets from the Chapter 3 exercises. If you forget how to import data into S-Plus, or how to make/open a workspace, review lab 1 to refresh your memory. Download the heart rate data shown in Table 3.5 of Section 3.5 of POB (or copy it from your CD). Import this into S-Plus (see lab one!). We will assume the name of the data set is heart.rates. We'll use the Command Window to do some calculations. On the
pull-down menus choose: Window>Commands Window so that
there is a check mark next to Commands Window.
Alternatively, press the toolbar button that looks like Let's to the same thing using the median: Now let's calculate some measures of dispersion. First, try the
range and interquartile range (IRQ) Quick, calculate the variance and standard deviation! Now, download the birth weights.xls file and import it into S-Plus. You may already have the data in S-Plus from lab1. We'll calculate the grouped mean and grouped variance as on pages 57 and 58 of POB. We'll assume the name of the data set is birth.weights. First, we need to add a column to the data set. With the pull down menu do Insert>Column.... Type midpoint into the Names(s) field of the Insert Columns dialog box. Leave everything else as is. Click OK. Enter the midpoints of the classes into the midpoint column as in POB page 57: 249.5, 749.5, ..., 5249.5. Now go to the command window (see above on how to get there). First, attach the birth.weights data.frame: For the Energetic StudentFor exercise 15, download the Excel file lowbwt.xls or copy it from your CD. Import the Excel file into S-Plus. (See lab 1 if you do not remember how to do this). By default, S-Plus will read numerical values as double precision. Since the variable sex is a categorical variable, we need to change its data type. Go to the Data menu and select Change Data Type (at the bottom of the list). Select sex for the column to change, and then select Character for the New Data type. Click OK (or Apply to check the result first) Side-by-side boxplots: Before calculating anything, let's create side-by-side box-plots: Go to the Graph menu and select 2D Graphs > Boxplots (x, grouping optional). Remember that the variable that we are plotting goes on the Y-axis, so select sbp for the y-column. To have separate box-plots for each level of the variable sex, select sex for the x-Column. Change any options for the box plot by going to the Options tab, then select OK. Add a title, legend, and clean up any axes labels. Without skipping ahead to the numerical summaries, what is the median systolic blood pressure for males? For females? Based on the plot, do you expect the mean to be close to the median for each group? Are there any outliers? Numerical Summaries: To obtain numerical summaries, we will use the Statistics menu. Under the Statistics menu, select Data Summaries, then Summary Statistics. In the dialog box, pick which variables that you want to have summarized, i.e. sbp. To get summaries for males and females separately, specify sex as the Group Variable. For other exercises in the chapter you can leave the Group Variable at the default setting of None. The output will appear in a "Report" window. You can cut and past this into your favorite word processor along with the graphs. Can you figure out how to get S-Plus to give you the coefficient of variation for each gender? (This is not obvious.) Trimmed Means: To get started, download and import the unicef.xls data. In exercise 12, you are asked to find the 5% trimmed mean. This is not available under the Statistics menu, but can be obtained using S-plus's Command Window as discussed above. Recall data in S-Plus are stored as a data.frame. On the command line, if we want to refer to a variable within a data.frame we have two options: we can specify the dataframe/variable as unicef$lowbwt (dataframe-name$variable-name) where the $ separates the dataframe from the variable. In practice, this means more typing, and more chances for mistakes! The second approach is to attach the data.frame as we did previously. > attach(unicef) Now we can refer to variables in the dataframe unicef directly. Now type in the variable name lowbwt and hit enter. All the values stored in the variable lowbwt are displayed. To find the 5% trimmed mean, enter after the prompt >mean(lowbwt, trim=.05) What did you get? Note that in the unicef data.frame, there are many observations with a value NA, meaning Not Available or Not A number. In S-plus, we will often have to tell functions what to do in the case of NA's, such as remove them, otherwise the answer will be NA. To take care of NA's, use >mean(lowbwt,trim=0.05, na.rm=T) Now what did you get? The na.rm=Toption means the NA's should be removed. The default is to keep them in. Repeat without the trim option: > mean(lowbwt, na.rm=T) This should agree with the mean provided by the Summary Statistics menu, as the default here is trim=0 (no trimming). |