Statistics 101
Data Analysis and Statistical Inference
Instructions for Lab 2
Lab Objective: To analyze an
observational study with methods we’ve covered so far.
Lab Procedures
These days, it is widely understood that mothers who smoke during pregnancy
risk exposing their babies to many health problems. This was not common
knowledge forty years ago. One of the first studies that addressed the
issue of pregnancy and smoking was the Child Health and Development Studies, a
comprehensive study of all babies born between 1960 and 1967 at the Kaiser
Foundation Hospital in Oakland, CA. The original reference for the study
is Yerushalmy (1964, American Journal of Obstetrics and Gynecology, pp.
505-518). The data and a summary of the study are in Nolan and Speed
(2000, Stat Labs, Chapter 10) and can be found at the web site
http://www.stat.berkeley.edu/users/statlabs.
Open the data set in JMP by clicking here.
There were about 15,000 families in the study, but we will only analyze a
subset of the data that contains1,236 male single births where the baby lived
at least 28 days. The researchers
interviewed mothers early in their pregnancy to collect information on socioeconomic
and demographic characteristics, including an indicator of whether the
mother smoked during pregnancy. This is an observational study, because
mothers decided whether or not to smoke during pregnancy; there was no random
assignment to smoke or not to smoke. The variables in the dataset are
described in the code book at the end of these instructions.
The Surgeon General's Report (1989) states two assertions about smoking and
pregnancy:
(a) Mothers who smoke
have increased rates of premature delivery (before 270 days).
(b) The newborns of
mothers who smoke have smaller birth weights at every gestational age (number
of days into pregnancy when child is born).
We will analyze the data to see if they support the Surgeon General's assertions. For simplification, we'll compare babies
whose mothers smoke to babies whose mothers have never smoked. The data
file you have access to has only these people, although there were other types
of smokers in the original dataset.
1)
In an observational study, the treatment groups should be closely
balanced on causally-relevant background variables to mitigate the effect of
confounding factors. Identify background
characteristics in the data set that might affect gestation length and birth
weight. Then, compare the distributions
of those background variables for the smokers and non-smokers. Describe succintly any substantial differences you see between the group of smokers and
the group of non-smokers, reporting means and SDs as support for your comparisons. Which variables did you examine that proved
to be similar for both groups?
To do this efficiently, go to the Analyze menu option and select Fit-Y-by-X. Enter all the background characteristics into
Y and “smoke” into X.
You can highlight all the relevant background characteristics
simultaneously, and with one click on Y,
make them Y variables. Remember that we are comparing the balance of
background variables, not outcome variables.
For
continuous variables: Examine the box plots, means, and standard
deviations by clicking on the red arrow next to “Oneway Analysis”
and selecting Quantiles and Means and Std Dev. If you aren’t sure whether the two
groups as “similar enough” on a particular variable, a good
guideline to follow is this: Compute the difference in the means between the
two groups, then divide by the average of their standard deviations. If this quantity is around 0.10 or less, the
means are pretty close (within 10% of a “combined SD” for the two
groups).
For
categorical variables: Compare the percentages of people in each
category of the variable for smokers and non-smokers. The simplest way to do this is to click on
the red arrow next to “Contingency Table” and uncheck everything
until only “Row %” remains checked.
You can compare percentages in the same way that you compared means
above. To calculate the standard
deviation for each percetage, use SD = square root [ (%)(1-%)], where % equals
the percentage.
Although these basic checks can be tedious, it’s important not to
skip this step when analyzing an observational study or else you are liable to
arrive at incorrect conclusions without suspecting anything is wrong. When the background characteristics in the
groups differ substantially, you can improve the balance by matching treated
and control observations. However,
matching won’t work here because the sizes of the two groups are so
similar (479 smokers and 539 non-smokers) that any matching scheme would result
in almost the entire data set being selected.
Regardless of whether you found dramatic differences in the background
variables between the smokers and non-smokers, we will proceed as if the data set
is valid for checking the assertions of the Surgeon General.
2)
A premature birth is defined as one that occurs before a gestational age
of 270 days (about 9 months). Do the
data provide evidence that supports the Surgeon General’s assertion that
smoking during pregnancy increases the rate of premature delivery? Write a concise answer, reporting what
variables and statistics you used to reach your conclusion.
These percentages are based only on samples of people, so they are subject
to chance error. In Chapter 26 we'll learn how to determine the
probability that the difference in the percentages could be explained by chance
error.
3) The Surgeon General also claims that
newborns of mothers who smoke have smaller birth weights at every gestational
age (number of days into pregnancy when child is born). Perform a statistical analysis that allows
you to answer the question “Do the data support the Surgeon General’s
assertion?” Write 2 – 3 sentences
explaining your analysis and conclusion, including relevant numerical or
graphical evidence. Examine the
sensitivity of your conclusion to the effects of outliers, and consider that
there may be some gestational ages for which the data don’t provide
enough evidence to make claims either way.
This type of problem mirrors real life – no one tells you exactly how
to analyze the data. Think about it for
a while before asking the TAs for advice.
EXTREMELY IMPORTANT!!
When you report the results of an observational study like this one to a journal
or to some policy-making body, it is crucial to inform your audience that there
may be causally-relevant background characteristics not in the dataset that are
not balanced in the two groups. Whenever possible, you also should
suggest examples of such variables (e.g., use of drugs, alchohol, caloric
intake, health condition of mother). This is the ethical thing to do,
even if it results in your analyses taking criticism. Telling the truth
about the limitations of a study does more good for society than does hiding or
not reporting them, which could lead to bad policy that ultimately hurts
people.
Code Book
Variable Description
id
id number
birth
birth date where 1096 = January 1, 1961
gestation
length of gestation in days
bwt/oz
birth weight in ounces (999 = unknown)
parity
total number of previous pregnancies, including
fetal deaths and still births (99 = unknown)
mrace
mother's race or ethnicity
0-5 = white
6 = mexican
7 = black
8 = asian
9 = mix
10 = unknown
mage
mother's age in years at termination of
pregnancy
med
mother's education
0 =
less than 8th grade
1 = 8th to 12th grade. did not
graduate high school
2 = high school graduate, no other
schooling
3 = high school graduate + trade school
4 = high school graduate + some college
5 = college graduate
7 = trade school but unclear if
graduated from high school
9 = unknown
mht
mother's height in inches
mpregwt
mother's pre-pregnancy weight in pounds
drace
father's race or ethnicity
0-5 = white
6 = mexican
7 = black
8 = asian
9 = mix
10 =
unknown
dage
father's age in years at termination
of pregnancy
ded
father's education
0 = less than 8th grade
1 = 8th to 12th grade. did not
graduate high school
2 = high school graduate, no other
schooling
3 = high school graduate + trade school
4 = high school graduate + some
college
5 = college graduate
7 = trade school but unclear if
graduated from high school
9 = unknown
dht
father's height
dwt
father's pre-pregnancy weight in
pounds
marital
marital status of
mother
1 =
married
2 = legally
separated
3 =
divorced
4 =
widowed
5 = never
married
inc
family yearly
income in 2,500 increments
0 = under 2,500
1 =
2,500 – 4,999 …
9 =
15,000+
98 =
unknown
99 =
not asked
smoke
does mother smoke?
0 =
has never smoked
1 = smokes now
2 = until pregnant
3 =
once did, not now
time
If mother quit, how long ago did she
quit?
0 =
never smoked
1 =
still smokes
2 =
quit during pregnancy
3 =
up to 1 yr ago
4 =
up to 2 yr ago
5 =
up to 3 yr ago
6 =
up to 4 yr ago
7 = 5
to 9 yr ago
8 =
10+ yr ago
9 =
quit and don't know
98 = unknown
number
number of cigarettes smoked a day for past and
current smokers
0 = never smoked
1 =
1-4
2 =
5-9
3 =
10-14
4 =
15-19
5 =
20-29
6 =
30-39
7 =
40-60
8 =
60+
9 =
smoke but don't know
premature
was the baby born premature?
1 =
if baby born before gestational age of 270
0 = if baby born on or after gestational
age of 270