Statistics 101
Data Analysis and Statistical Inference

Instructions for Lab 2



Lab Objective:
To analyze an observational study with methods we’ve covered so far.

Lab Procedures


These days, it is widely understood that mothers who smoke during pregnancy risk exposing their babies to many health problems.  This was not common knowledge forty years ago.  One of the first studies that addressed the issue of pregnancy and smoking was the Child Health and Development Studies, a comprehensive study of all babies born between 1960 and 1967 at the Kaiser Foundation Hospital in Oakland, CA.  The original reference for the study is Yerushalmy (1964, American Journal of Obstetrics and Gynecology, pp. 505-518).  The data and a summary of the study are in Nolan and Speed (2000, Stat Labs, Chapter 10) and can be found at the web site http://www.stat.berkeley.edu/users/statlabs.

Open the data set in JMP by clicking here.

There were about 15,000 families in the study, but we will only analyze a subset of the data that contains1,236 male single births where the baby lived at least 28 days. The researchers interviewed mothers early in their pregnancy to collect information on socioeconomic and demographic characteristics, including an  indicator of whether the mother smoked during pregnancy.  This is an observational study, because mothers decided whether or not to smoke during pregnancy; there was no random assignment to smoke or not to smoke.  The variables in the dataset are described in the code book at the end of these instructions.


The Surgeon General's Report (1989) states two assertions about smoking and pregnancy:

(a) Mothers who smoke have increased rates of premature delivery (before 270 days).  
(b) The newborns of mothers who smoke have smaller birth weights at every gestational age (number of days into pregnancy when child is born).

We will analyze the data to see if they support the Surgeon General's assertions. For simplification, we'll compare babies whose mothers smoke to babies whose mothers have never smoked.  The data file you have access to has only these people, although there were other types of smokers in the original dataset.

 

1) In an observational study, the treatment groups should be closely balanced on causally-relevant background variables to mitigate the effect of confounding factors. Identify background characteristics in the data set that might affect gestation length and birth weight. Then, compare the distributions of those background variables for the smokers and non-smokers. Describe succintly any substantial differences you see between the group of smokers and the group of non-smokers, reporting means and SDs as support for your comparisons. Which variables did you examine that proved to be similar for both groups?

To do this efficiently, go to the Analyze menu option and select Fit-Y-by-X. Enter all the background characteristics into Y and “smoke” into X. You can highlight all the relevant background characteristics simultaneously, and with one click on Y, make them Y variables. Remember that we are comparing the balance of background variables, not outcome variables.

For continuous variables: Examine the box plots, means, and standard deviations by clicking on the red arrow next to “Oneway Analysis” and selecting Quantiles and Means and Std Dev. If you aren’t sure whether the two groups as “similar enough” on a particular variable, a good guideline to follow is this: Compute the difference in the means between the two groups, then divide by the average of their standard deviations. If this quantity is around 0.10 or less, the means are pretty close (within 10% of a “combined SD” for the two groups).

For categorical variables: Compare the percentages of people in each category of the variable for smokers and non-smokers. The simplest way to do this is to click on the red arrow next to “Contingency Table” and uncheck everything until only “Row %” remains checked. You can compare percentages in the same way that you compared means above. To calculate the standard deviation for each percetage, use SD = square root [ (%)(1-%)], where % equals the percentage.

Although these basic checks can be tedious, it’s important not to skip this step when analyzing an observational study or else you are liable to arrive at incorrect conclusions without suspecting anything is wrong. When the background characteristics in the groups differ substantially, you can improve the balance by matching treated and control observations. However, matching won’t work here because the sizes of the two groups are so similar (479 smokers and 539 non-smokers) that any matching scheme would result in almost the entire data set being selected. Regardless of whether you found dramatic differences in the background variables between the smokers and non-smokers, we will proceed as if the data set is valid for checking the assertions of the Surgeon General.

 

2) A premature birth is defined as one that occurs before a gestational age of 270 days (about 9 months). Do the data provide evidence that supports the Surgeon General’s assertion that smoking during pregnancy increases the rate of premature delivery? Write a concise answer, reporting what variables and statistics you used to reach your conclusion.

These percentages are based only on samples of people, so they are subject to chance error.  In Chapter 26 we'll learn how to determine the probability that the difference in the percentages could be explained by chance error.

 

3) The Surgeon General also claims that newborns of mothers who smoke have smaller birth weights at every gestational age (number of days into pregnancy when child is born). Perform a statistical analysis that allows you to answer the question “Do the data support the Surgeon General’s assertion?” Write 2 – 3 sentences explaining your analysis and conclusion, including relevant numerical or graphical evidence. Examine the sensitivity of your conclusion to the effects of outliers, and consider that there may be some gestational ages for which the data don’t provide enough evidence to make claims either way.

This type of problem mirrors real life – no one tells you exactly how to analyze the data. Think about it for a while before asking the TAs for advice.

 

EXTREMELY IMPORTANT!!
When you report the results of an observational study like this one to a journal or to some policy-making body, it is crucial to inform your audience that there may be causally-relevant background characteristics not in the dataset that are not balanced in the two groups.  Whenever possible, you also should suggest examples of such  variables (e.g., use of drugs, alchohol, caloric intake, health condition of mother).  This is the ethical thing to do, even if it results in your analyses taking criticism.   Telling the truth about the limitations of a study does more good for society than does hiding or not reporting them, which could lead to bad policy that ultimately hurts people.


Code Book

Variable             Description
id                       id number

birth                   birth date where 1096 = January 1, 1961

gestation             length of gestation in days

bwt/oz                birth weight in ounces (999 = unknown)

parity                 total number of previous pregnancies, including fetal deaths and still births (99 = unknown)

mrace                 mother's race or ethnicity
                             0-5 = white
                                6 = mexican
                                7 = black
                                8 = asian
                                9 = mix
                              10 = unknown

mage                  mother's age in years at termination of pregnancy

med                   mother's education
                               0 =  less than 8th grade
                                1 =  8th to 12th grade. did not graduate high school
                                2 = high school graduate, no other schooling
                                3 = high school graduate + trade school
                                4 = high school graduate + some college
                                5 = college graduate
                                7 = trade school but unclear if graduated from high school
                                9 = unknown

mht                      mother's height in inches

mpregwt             mother's pre-pregnancy weight in pounds

drace                   father's race or ethnicity
                               0-5 = white
                                  6 = mexican
                                  7 = black
                                  8 = asian
                                  9 = mix

10 = unknown

dage                    father's age in years at termination of pregnancy

ded                      father's education 
                                 0 =  less than 8th grade
                                 1 =  8th to 12th grade. did not graduate high school
                                 2 = high school graduate, no other schooling
                                 3 = high school graduate + trade school
                                 4 = high school graduate + some college
                                 5 = college graduate
                                 7 = trade school but unclear if graduated from high school
                                 9 = unknown

dht                      father's height

dwt                     father's pre-pregnancy weight in pounds

marital               marital status of mother
                              1 = married
                              2 = legally separated
                              3 = divorced
                              4 = widowed
                              5 = never married

inc               family yearly income in 2,500 increments

  0 = under 2,500

1 = 2,500 – 4,999 …

9 = 15,000+ 

98 = unknown

99 = not asked

smoke                does mother smoke?
                                0 = has never smoked
                               1 = smokes now
                               2 = until pregnant
                               3 = once did, not now

time                    If mother quit, how long ago did she quit?
                                0 = never smoked
                                1 = still smokes
                                2 = quit during pregnancy
                                3 = up to 1 yr ago
                                4 = up to 2 yr ago
                                5 = up to 3 yr ago
                                6 = up to 4 yr ago
                                7 = 5 to 9 yr ago
                                8 = 10+ yr ago
                                9 = quit and don't know
                               98 = unknown

number                 number of cigarettes smoked a day for past and current smokers
                                 0 = never smoked
                                1 = 1-4
                                2 = 5-9
                                3 = 10-14
                                4 = 15-19
                                5 = 20-29
                                6 = 30-39
                                7 = 40-60
                                8 = 60+
                                9 = smoke but don't know

premature               was the baby born premature?

1 = if baby born before gestational age of 270

0 = if baby born on or after gestational age of 270