STA 102 Introductary Biostatistics - Computer Laboratory


Exercises: 2.29-2.30, page 42 Rosner

Data File:lead.xls (EXCEL format)

Documentation File: LEAD.DOC (MS Word format)


TO HAND IN:

Type up a one paragraph summary that addresses all of the points in exercises 2:29-2:30. All computer output that is included as part of the summary should be clearly labeled, and any extraneous material should be removed using an editor.


Topics Covered

  1. Starting S-Plus
  2. Creating a Workspace
  3. Importing Data into S-Plus
  4. Summary Statistics
  5. Graphs

The following instructions are based on the assumption that you are using S-Plus in one of the computer labs on campus. If you have any questions about how to do the assignment at home, do not hesitate to ask your TA or post a question on the Course Info Discussion Board!

Preliminaries:

For this exercise we will refer to the lead data on the computer disk from the book. All data sets are available from the web site in the Rosnerdata folder. Files ending in .DOC are Microsoft Word files with the format and documentation of the data set. Files ending in .XLS are stored as Excel files, while the raw data are available in files ending in .DAT. In this series of exercises, we will cover how to import Excel files into S-Plus, so you will first need to save the file lead.xls to your computer. The descriptions of the variables are in Lead.doc. Note: you may want to save these files to your own floppy disk so that you may continue working on another computer at a later stage (if need be). Remember where you saved them!

Starting S-Plus:

Start up S-Plus using the Start menu. The program may be listed under Programs > Statistics & Mathematics > S-PLUS 2000. If you have questions, please check with the TA.

Creating your Personal Workspace:

You can create your own workspace directory for saving your class work, this will help if you need to move files to different computers. See Chapter 7 of the Online Guide to Statistics (under the Help Menu) or follow the instruction here for creating a workspace.

Importing Data:

The next thing we need to do is read in the data. S-Plus stores data in objects called "dataframes". A dataframe is like a table of numbers, with columns corresponding to variables and rows corresponding to observations. Data in dataframes can be continuous or discrete (numeric) or categorical.

When you start S-Plus, a "Select Data" window may be open. Select "Import data", and click "OK". If the window was not opened at startup (or you need to create a new dataframe at a later point in the session), go to the File menu and select Import Data > From file. In the "Import Data" window, select "Microsoft Excel Files (*.xl*)" in the Files of Type scrolling menu. Enter or browse to select the file lead.xls that you previously downloaded. The default name for the dataframe will be lead; you may change it if you prefer. (just be sure you use your name in place of lead in any commands below). Click on Open (or hit Enter). You should see a "spreadsheet" that represents the dataframe lead, with 38 columns and with 124 rows or observations. (Note: the Excel file has had redundant ID and record variables removed, and the ID and record variables combined to make one identifier, so that the Excel file has fewer columns than as described in the Word file).

Creating a Subset

The variables that we are interested in are Age, Sex, Iqv (IQ verbal), Iqp (IQ performance), and Lead.type (the lead grouping variable). The control group corresponds to children with Lead.type = 1; the exposed group is defined as children with Lead.type = 2; we will ignore all cases with Lead.type = 3. To make life easier, we will create a new dataframe using the subset of data for only the exposed and control groups. To create a subset go to the Data menu and select Subset.... Rather than keeping all 38 columns, select columns Age, Sex, Iqv ,Iqp, and Lead.type, for the Columns in Subset field by clicking on the first name and then using a Ctrl-Click (press the Ctrl key while simultaneously clicking with the mouse on the name) to select additional columns. To specify which values we can use a logical expression; we want all values of Lead.type except 3, so enter

Lead.type != 3

in the Subset field. The "!=" means Not Equal to. Enter a name for the dataframe for the Save In field, i.e. leadhw1, then click OK. Avoid names with hyphens or other symbols

Recoding

To make summaries and output more meaningful we can recode the data as follows. Go to the Data menu and select Recode... Make sure that your new subset is selected in the Data Set field. Select Sex as the column to recode. For Current value, select 1. For the new value enter "Male". The quotes are necessary! Press Apply to update the column. Now repeat for females; select the Current Value of 2, and enter "Females" for the New value. Press OK. All of the 1's and 2's should be recoded as Male and Female. We need to inform S-Plus that the variable Sex is now a categorical variable or factor, so go to the Data menu and select Change Data Type. Select Sex in the Column field, and then select factor for the New Type. Click on OK.

Repeat these steps for the Lead.type variable so that the 1'a and 2's are recoded as Control and Exposed, respectively and the Data Type is a factor.

The last transformation we need to do is recode Age. The values are recorded as YearMonth, so 509 is a child who is 5 years 9 months old. We can convert this to either years or months rather than the mix by applying a clever transformation. Let's recode Age in to Years by applying the following formula:

floor(Age/100) + (Age - 100*floor(Age/100))/12

The function floor(Age/100) would take 509/100 and leave 5 (the number of years only). The second part extracts the number of months and converts it to fractional years. Adding the two parts gives us the Age in Years. To apply this, go to the Data menu and select Transform... Type Years in the Target Column, so that we create a new column. Now just cut in past the above expression or enter it into the Expression field. Click on Apply to see the results. Verify that they are correct, then click on OK.

Now we are ready to analyze the data!

Descriptive Summaries:

For numerical data (continuous or discrete), such as Iqv, numerical summaries such as means, standard deviations and quantiles can be obtained from the Statistics menu. For factors such as Sex we would like counts or proportions. To obtain these summaries, select Statistics, then Data Summaries, then Summary Statistics... In the popup window, you should see that leadhw1 is selected as the Data Set. If not, select or enter the dataframe name. To obtain summary statistics for verbal IQ only, scroll down until Iqv is highlighted and click to select it in the Variable scrolling menu. To get summaries for multiple variables, use a Ctrl-Click to highlight additional variables in the Variable menu. In this case just go ahead and select all. By default, summaries are based on all cases, but in this situation we would like to have summaries broken down by the two lead level groups. To provide group summaries, scroll down to highlight Lead.type in the Group Variable menu. Click on OK. The summaries should appear in the Report window. You may copy/paste them to a Word document or your favorite editor.

For categorical variables such as Sex and Lead.type, cross-classification provides a breakdown for comparing how many females and males are in each Lead group. Go to the Statistics menu, and select Data Summaries, then Crosstabulation.... Click on Sex, and then scroll down and use a Ctrl-Click to select Lead.type. Click on OK. The summary will appear in the Report window. In each "cell" of the cross-classification of the 2 variables (total of 6 cells) the numbers correspond to

Do the proportions indicate that there is a difference in the exposed and control groups based on gender? Which set of counts or proportions should be used for comparing the groups?

Graphical Summaries:

Side-by-side Boxplots: For comparing numerical data using 2 or more groups, side by side boxplots are often more useful than histograms. To create boxplots for IQ verbal go to the Graph menu and select 2D Plots, then select Box Plot (x, grouping optional). Click OK. In the popup window, select Lead.type for the x variable and Iqv for the y variable. Click on OK. The plot should appear in a Graph Window. Repeat for other numerical data. To paste the graph into a word processor such as Word, make sure the Graph Sheet is the active Window, then click on the icon of the Clipboard with a graph (the third row down from the top of the S-Plus window, far right -- if you hold the mouse over it you should see "Send Graph to Other App'). A dialog box should appear indicating the graph has been sent to the clipboard. Go to your word processor and click on the clipboard to past the graph into your document. Resize as needed.

Histograms: Histograms can be obtained by selecting Graph > 2D Plots > Histogram. Select Iqv for the X Columns. To produce separate histograms for each group, you will need to use the subset option. For the control group enter in the Subset field

Lead.type== "Control"

This uses only the Control cases. Similarly, create a histogram of verbal IQ for the exposed group. Which is more useful in this example, the boxplot or histogram?

Bar Plots: Bar graphs are not as easy to obtain from the Graph menu, unless one has already created a dataset with the counts (heights of the bars). To do this we can create a new dataframe with the summary counts for the 4 categories, Male-Control, Male-Exposed, Female-Control and Female-Exposed. To do this, go to the File menu and select New > Data Set. In the first column enter the 4 categories, and then in the second column enter the cell counts that you found earlier. Double-clicking on the column name, will allow you to rename the variables. Rename column 1 Group, and column 2 Total. To create a bar graph, go to the Graph menu and select 2D Plot and then Bar - Base at Zero. Click OK. Select Total for the Y-Column, and Group for the X-Column. Click OK.

Will the Bar Plot change if you used proportions instead of totals? What about using the row or column proportions from the cross-tabulation; would these be meaningful?


Calendar | Course Home | Course Info