In the following exercises we will learn to visualize and summarize relationships between two numerical variables as well as a numerical and a categorical variable. We will cover evaluation of relationships between two categorical variables next week.

Getting Started

Load packages

In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. The data can be found in the openintro package.

Let’s load the packages.

library(dplyr)
library(ggplot2)
library(oilabs)

Creating a reproducible report

We will be using a markdown language, R Markdown, to type up the report. This allows you to complete your data analysis entirely in RStudio as well as ensuring reproducibility of your analysis and results. To help get you started we are providing a template for you. Use the following code to download this template:

download.file("http://stat.duke.edu/~mc301/ARTSCI101_Su16/post/r/nc_births_template.Rmd", destfile = "nc_births.Rmd")

You will see a new file called nc_births.Rmd in the Files tab on the pane in the bottom right corner of your RStudio window. We will refer to this as your “R markdown file” or “your report”. Click on the file name to open the file. All you need to do to complete the lab is to type up your brief answers and the R code (when necessary) in the spaces chunks provided in the document.

Before you keep going type your team name, and then click on Knit HTML. You’ll see your compiled document in a new pop-up window.

A note on workspaces: Before we get started with the lab, let’s take a moment to review our R Markdown workflow and remind ourselves about workspaces in R. The workspaces of the console and the workspaces of your R Markdown document are not the same. Therefore, if you define a variable only in the Console and then try to use that variable in your R Markdown document, you’ll get an error. This might seem frustrating at first, but it is actually a feature that helps you in the long run. In order to ensure that your report is fully reproducible, everything that is used in the report must be defined in the report, and not somewhere else.

It is your responsibility, and an important learning goal of this course, that you master the skills for creating fully reproducible data analysis reports. Below are some tips for achieving this goal:

  • Always work in your R Markdown document, and not in the Console.
  • Knit early, and often, always checking that the resulting document contains everything you expected it to contain.

The data

In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.

Load the nc data set into our workspace.

data(nc)

We have observations on 13 different variables, some categorical and some numerical. The meaning of each variable is as follows.

variable description
fage father’s age in years.
mage mother’s age in years.
mature maturity status of mother.
weeks length of pregnancy in weeks.
premie whether the birth was classified as premature (premie) or full-term.
visits number of hospital visits during pregnancy.
marital whether mother is married or not married at birth.
gained weight gained by mother during pregnancy in pounds.
weight weight of the baby at birth in pounds.
lowbirthweight whether baby was classified as low birthweight (low) or not (not low).
gender gender of the baby, female or male.
habit status of the mother as a nonsmoker or a smoker.
whitemom whether mom is white or not white.
  1. What are the cases in this data set? How many cases are there in our sample?

You can answer this question by viewing the data in the data viewer or by using the following command:

str(nc)

As you review the variables, consider which variables are categorical and which are numerical.

Types of variables: Numerical variables can take a wide range of numerical values, and it is sensible to add, subtract, or take averages with those values. There are two types of numerical variables: continuous (measured on a continous scale) and discrete (can only take on whole number values, often counted not measured).

Categorical variables can only take on a finite set of values (levels). There are two types of categorical variables: those whose levels follow a logical order (these are said to be ordinal) and those whose levels do not follow an order.

  1. Determine whether each variable in this dataset is numerical or categorical. Classify numerical variables further as continuous or discrete. Classify categorical variables further as ordinal or not ordinal.

Weights of babies

Let’s first take a look at the distribution of weights of babies. A histogram is a useful visualization for a numerical variable.

ggplot(data = nc, aes(x = weight)) +
  geom_histogram()

When you ran this code you likely got the following warning:

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This warning is meant to serve as a reminder that default binwidths for histograms may not be ideal. We can customize the binwidth by providing this information as an argument in the geom_histogram function.

ggplot(data = nc, aes(x = weight)) +
  geom_histogram(binwidth = 1)
  1. Make histograms with binwidths 0.1, 1, and 5. Describe how these three histograms binwidths compare. Which of these appears to be the best representation of these data, and why?

We can also obtain relevant summary statistics for this variable using the summarise function:

nc %>%
  summarise(mean_wt = mean(weight), sd_wt = sd(weight), n = n())

Note that in the summarise function we created a list of three elements. The names of these elements are user defined, like mean_wt, sd_wt, n, and you could customize these names as you like (just don’t use spaces in your names). Calculating these summary statistics also require that you know the function calls. Note that n() reports the sample size.

Summary statistics: Some useful function calls for summary statistics for a single numerical variable are as follows:

  • mean
  • median
  • sd
  • IQR: Inter-quartile range, range of the middle 50% of the data
  • range
  • min
  • max
  1. Calculate the median weight of babies born as well as the IQR.

  2. Using the histogram you determined to be the best representation as well as relevant summary statistics you calculated so far, describe the distribution of weights of babies born. Make sure to mention the shape, center, and spread, as well as any unusual observations. Hint: If the distribution is symmetric, the mean and the standard deviations are useful measures of center and spread. If the distribution is skewed, the median and the IQR are better measures.

Smoking and baby weights

Next, consider the possible relationship between a mother’s smoking habit and the weight of her baby. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.

Side-by-side box plots are useful for visualizing the relationship between a numerical and a categorical variable.

ggplot(data = nc, aes(x = habit, y = weight)) +
  geom_boxplot()
  1. What does the plot highlight about weights of babies born to mothers who are smokers vs. mothers who are not smokers?

The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions. In order to do this we need to first group the data by the habit variable, and then calculate the mean weight in these groups using the mean function.

nc %>%
  group_by(habit) %>%
  summarise(mean_weight = mean(weight))

Another option for visualizing these data is to make histograms for each group. An easy way of doing this is via faceting:

ggplot(data = nc, aes(x = weight)) +
  geom_histogram(binwidth = 1) +
  facet_grid(. ~ habit) # places histograms next to each other

At this point the observations that have NA values for habit might be bugging you. We could filter them out to obtain cleaner plots.

First we create a new dataset that omits all observations where habit is NA. We do this with the help of the filter function. The syntax for “is not NA” in R is !is.na().

nc_wo_nabitNA <- nc %>%
  filter(!is.na(habit))

Then, we use this new dataset in our plot.

ggplot(data = nc_wo_nabitNA, aes(x = weight)) +
  geom_histogram(binwidth = 1) +
  facet_grid(habit ~ .) # places histograms underneath each other
  1. Which faceting option (histograms next to each other or underneath each other) make it easier to compare the centers of the histograms and hence answer the question whether smoking is associated with lower or higher baby weights on average. Include the chosen visualization in your answer.

Father’s and mother’s age

We can visualize the relationship between two numerical variables using scatterplots:

ggplot(data = nc, aes(x = mage, y = fage)) +
  geom_point()

Note that this will give a warning about missing values. Since we don’t have age information on all fathers, these observations are removed from the visualization.

We summarize the strength of the relationship between these two variables with the correlation coefficient. The function for calculating the correlation coefficient is cor, and in this function in addition to the variables of interest, we must specify that we want to only use the observations for which we have complete data (no NAs) with use = "complete.obs".

cor(nc$fage, nc$mage, use = "complete.obs")

If we didn’t have any NAs in the variables of interest, we could omit the third argument above.

Subgroup analysis

If we want to run any of the analyses we completed so far on a subgroup of the data, all we need to do is to filter the data first for the desired observations, and use the same code to replicate our analysis for this subgroup.

Suppose we want to evaluate the relatonship between baby weights and smoking for teen mothers, mothers below the age of 19.

First, create a dataset filtering for these mothers:

nc_teen_mom <- nc %>%
  filter(mage < 19)
  1. Create side-by-side box plots of baby weights and smoking for teen mothers. Hint: You can use the same code from above, you just need to refer to the dataset that contains information on just these mothers (nc_teen_mom), instead of the full dataset (nc).

  2. Calculate the average baby weights for teen mothers who are smokers and non-smokers.

Further exploration

For the following questions use the complete nc dataset, instead of the subgroup you created in the previous section.

  1. Using side-by-side boxplots or faceted histograms, visualize the relationship between smoking and gestational length (weeks). Provide a discussion of this relationship, and make sure to mention appropriate summary statistics in this discussion (e.g. medians for each group).

  2. Using a scatterplot, visualize the relationshop between smoking and weight gained during pregnancy. Provide a discussion of this relationship, and make sure to mention appropriate summary statistics in this discussion.

  3. Determine the age cutoff for younger and mature mothers. Use a method of your choice, and explain how your method works.

  4. Pick a pair of variables: both numerical. Come up with a research question evaluating the relationship between these variables and use a visualization and appropriate summary statistics to answer this question. Make sure to not use the same combination of variables we used earlier, i.e. do not use mage and fage, but you can use one of them along with another variable.

  5. Pick a pair of variables: one numerical and one categorical. Come up with a research question evaluating the relationship between these variables and use a visualization and appropriate summary statistics to answer this question. Make sure to not use the same combination of variables we used earlier.


Submitting your work

Locate the files you want to export in the Files pane (lower right corner). These files are called

Check the box next them, click on More -> Export, and then click on Download in the pop-up window.

Then, submit these as part of your Stats assignment 2 on Sakai. Due date is Monday, July 18.