Due: 2018-04-05 at noon
In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.
Go to the course organization on GitHub: https://github.com/Sta199-S18.
Find the repo starting with lab-09
and that has your team name at the end (this should be the only lab-09
repo available to you).
In the repo, click on the green Clone or download button, select Use HTTPS (this might already be selected by default, and if it is, you’ll see the text Clone with HTTPS as in the image below). Click on the clipboard icon to copy the repo URL.
Go to RStudio Cloud and into the course workspace. Create a New Project from Git Repo. You will need to click on the down arrow next to the New Project button to see this option.
Copy and paste the URL of your assignment repo into the dialog box:
Hit OK, and you’re good to go!
In this lab we will work with the tidyverse
and infer
packages. We can install and load them with the following:
library(tidyverse)
library(infer)
library(openintro)
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following object is masked from 'package:ggplot2':
##
## diamonds
## The following objects are masked from 'package:datasets':
##
## cars, trees
Configure your Git user name and email. If you cannot remember the instructions, refer to an earlier lab. Also remember that you can cache your password for a limited amount of time.
Update the name of your project to match the lab’s title.
Pick one team member to complete the steps in this section while the others contribute to the discussion but do not actually touch the files on their computer.
Before we introduce the data, let’s warm up with some simple exercises.
Open the R Markdown (Rmd) file in your project, change the author name to your team name, and knit the document.
Now, the remaining team members who have not been concurrently making these changes on their projects should click on the Pull button in their Git pane and observe that the changes are now reflected on their projects as well.
In this lab we’ll be generating random samples. The last thing you want is those samples to change every time you knit your document. So, you should set a seed. There’s an R chunk in your R Markdown file set aside for this. Locate it and add a seed. Make sure all members in a team are using the same seed so that you don’t get merge conflicts and your results match up for the narratives.
Load the ncbirths
data from the openintro
package:
data(ncbirths)
We have observations on 13 different variables, some categorical and some numerical. The meaning of each variable is as follows.
variable | description |
---|---|
fage |
father’s age in years. |
mage |
mother’s age in years. |
mature |
maturity status of mother. |
weeks |
length of pregnancy in weeks. |
premie |
whether the birth was classified as premature (premie) or full-term. |
visits |
number of hospital visits during pregnancy. |
marital |
whether mother is married or not married at birth. |
gained |
weight gained by mother during pregnancy in pounds. |
weight |
weight of the baby at birth in pounds. |
lowbirthweight |
whether baby was classified as low birthweight (low ) or not (not low ). |
gender |
gender of the baby, female or male . |
habit |
status of the mother as a nonsmoker or a smoker . |
whitemom |
whether mom is white or not white . |
The first step in the analysis of a new dataset is getting acquanted with the data. Make summaries of the variables in your dataset, determine which variables are categorical and which are numerical. For numerical variables, are there outliers? If you aren’t sure or want to take a closer look at the data, make a graph.
Wen, Shi Wu, Michael S. Kramer, and Robert H. Usher. “Comparison of birth weight distributions between Chinese and Caucasian infants.” American Journal of Epidemiology 141.12 (1995): 1177-1187.
A 1995 study suggestes that average weight of Caucasian babies born in the US is 3,369 grams (7.43 pounds). In this dataset we only have information on mother’s race, so we will make the simplifying assumption that babies of Caucasian mothers are also Caucasian, i.e. whitemom = "white"
.
We want to evaluate whether the average weight of Caucasian babies has changed since 1995.
Our null hypothesis should state “there is nothing going on”, i.e. no change since 1995: \(H_0: \mu = 7.43~pounds\).
Our alternative hypothesis should reflect the research question, i.e. some change since 1995. Since the research question doesn’t state a direction for the change, we use a two sided alternative hypothesis: \(H_A: \mu \ne 7.43~pounds\).
Create a filtered data frame called ncbirths_white
that contain data only from white mothers. Then, calculate the mean of the weights of their babies.
Are the conditions necessary for conducting simulation based inference satisfied? Explain your reasoning.
Let’s discuss how this test would work. Our goal is to simulate a null distribution of sample means that is centered at the null value of 7.43 pounds. In order to do so, we
Consider the possible relationship between a mother’s smoking habit and the weight of her baby. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.
Make side-by-side boxplots displaying the relationship between habit
and weight
. What does the plot highlight about the relationship between these two variables?
Before moving forward, save a version of the dataset omitting observations where there are NAs for habit
. You can call this version ncbirths_habitgiven
.
The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following to first group the data by the habit
variable, and then calculate the mean weight
in these groups using.
ncbirths_habitgiven %>%
group_by(habit) %>%
summarise(mean_weight = mean(weight))
There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test .
Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.
Are the conditions necessary for conducting simulation based inference satisfied? Explain your reasoning.
Run the appropriate hypothesis test, calculate the p-value, and interpret the results in context of the dsta and the hypothesis test.
Construct a 95% confidence interval for the difference between the average weights of babies born to smoking and non-smoking mothers.
In this portion of the analysis we focus on two variables. The first one is maturemom
.
The other variable of interest is lowbirthweight
.