In the following exercises we will reuse earlier datasets from the class and update and customize our visualizations to make them more informative and more visually appealing.

Getting Started

Load packages

In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. We will also need the oilabs package for some of the datasets.

Let’s load the packages.

library(dplyr)
library(ggplot2)
library(oilabs)

Creating a reproducible report

We will be using a markdown language, R Markdown, to type up the report. This allows you to complete your data analysis entirely in RStudio as well as ensuring reproducibility of your analysis and results. To help get you started we are providing a template for you. Use the following code to download this template:

download.file("http://stat.duke.edu/~mc301/ARTSCI101_Su16/post/r/dataviz_customize_template.Rmd", destfile = "dataviz_customize.Rmd")

You will see a new file called dataviz_custom.Rmd in the Files tab on the pane in the bottom right corner of your RStudio window. We will refer to this as your “R markdown file” or “your report”. Click on the file name to open the file. All you need to do to complete the lab is to type up your brief answers and the R code (when necessary) in the spaces chunks provided in the document.

Before you keep going type your name and update the date, and then click on Knit HTML.

arbuthnot data

Load the data:

data(arbuthnot)

Remember, this dataset contains information on number of girls and boys baptised in 18th century London. The data are collected by Dr. Arbuthnot, hence the name of the dataset.

Change the background

Not a fan of the gray background? Use the black and white theme with theme_bw():

ggplot(data = arbuthnot, aes(x = year, y = girls)) +
  geom_line() +
  theme_bw()

Change axis limits

You can also change axis limits with xlim() and ylim():

ggplot(data = arbuthnot, aes(x = year, y = girls)) +
  geom_line() +
  theme_bw() +
  xlim(1629, 1710) +
  ylim(2500, 8000)

Remember, last week we also learned how to change the axis labels:

ggplot(data = arbuthnot, aes(x = year, y = girls)) +
  geom_line() +
  theme_bw() +
  xlim(1629, 1710) +
  ylim(2500, 8000) +
  labs(x = "Year", y = "Number of girls baptised", 
       title = "Number of girls baptised between 1629 and 1710")

present dataset

Load the data:

data(present)

Remember, this dataset contains information on number of girls and boys born in the current day United States. The data come from census records.

  1. Plot the proportion of girls born over time in the present dataset. Adjust the x and y axis limits, rename the axes, and the title of the plot. You can use the gray (default) or the black and white theme, decision is up to you. Hint: You will need to first create a new variable that records the proportion of girls born for each year (refer to previous instructions if you need a refresher on how to do that and work with your teammates), and then use this variable in your plot.

Resizing your plots

You might also be interested in resizing your plots in your report. You can do this by customizing your R chunk (as opposed to the R code inside the chunk).

  1. Copy and paste the R chunk from the previous exercise. Then, click on the gear icon on the first line of the R chunk. In the pop-up menu check the box for Use custom figure size and make your figure 4 inches wide by 3 inches tall. Then, click apply. You’ll see that this populates your R chunk definition with the following fig.height=3, fig.width=4. Next time you want to change your figure width you can either directly type code like this (with the measurements you desire) or use the pop-up options menu.

nc dataset

Load the data:

data(nc)

Remember, this dataset contains information on a random sample of 1000 births in North Carolina.

Conditioning by coloring and sizing

Suppose you want to evaluate the relationship between two numerical variables conditioned on a categorical variable. One option is to use a scatterplot where the points are colored according to the levels of the categorical variable.

Let’s plot weight gained vs. age of mother, conditioning on whether the mother was categorized as mature or younger.

ggplot(data = nc, aes(x = mage, y = gained, color = mature)) +
  geom_point()
## Warning: Removed 27 rows containing missing values (geom_point).

You can also customize the title of the legend. Note that we do this using the color argument in the labs() function since the title of the legend in this plot is determined by the variable that we are using to color the

ggplot(data = nc, aes(x = mage, y = gained, color = mature)) +
  geom_point() +
  labs(color = "Maturity status")
## Warning: Removed 27 rows containing missing values (geom_point).

  1. Include the plot we just described and describe what is wrong with it.

  2. Now make a more useful plot: Plot gained versus weight, conditioning on whether the mother was a smoker or not (color = habit). Also, customize the x and y axis labels, as well as the title of the legend.

Suppose we also want to see what part, if any, age of the mother plays in this relationship. Since we already assigned the color of the points to another variable, we need to use this variable in another way. For example, we can size the points according to the magnitude of the baby’s weight.

ggplot(data = nc, aes(x = weight, y = gained, color = habit, size = mage)) +
  geom_point() +
  labs(color = "Smoking habit")
## Warning: Removed 27 rows containing missing values (geom_point).

This is causing a lot of overplotting, so we might also make the points somewhat transparent so that we can see where the data density is higher. We can do this by changing the alpha level of the points.

ggplot(data = nc, aes(x = weight, y = gained, color = habit, size = mage)) +
  geom_point(alpha = 0.5) +
  labs(color = "Smoking habit")
## Warning: Removed 27 rows containing missing values (geom_point).

Note that the alpha argument went in the geom_point() function while the habit and mage went in the aes() (aesthetics) function. Why? Because the alpha level (transparency) of the points is something we are “hardcoding” while the other aesthetic features of the data depend on specific variables in the dataset.

  1. Make a plot of weight versus gained where the colors of the points are determined by habit and size of the points are determined by mage. Then, set the alpha level to 0.25 (even more transparent). Adjust the x and y axis labels, as well as the titles of the legends.

As you can see, all of this is doable. But is it a good idea to do it? Sizing the points by mother’s age doesn’t seem to tell us anything. In these exercises we’re teaching you some syntax (how to do things in R), but when you’re working on your project you will need to decide whether to do things. Don’t hesitate to try out a variety of plots, but don’t plan on including every single plot you try out in your final project. Always ask yourself whether the plot you made tells or reveals something meaningful about your data and your research question.

Conditioning by faceting

Another option for conditioning by a third or fourth variable is faceting.

ggplot(data = nc, aes(x = mage, y = gained)) +
  geom_point() +
  facet_grid(habit ~ marital)
## Warning: Removed 27 rows containing missing values (geom_point).

The NAs are getting in the way of the visualization, so we might first want to drop them:

nc_wo_na <- nc %>%
  filter(!is.na(marital)) %>%
  filter(!is.na(habit))
ggplot(data = nc_wo_na, aes(x = weight, y = gained)) +
  geom_point() +
  facet_grid(habit ~ marital)
## Warning: Removed 26 rows containing missing values (geom_point).

Now that we have some data in each of the facets let’s get back to the faceting syntax. Here we created a grid of facets where the horizontal axis of the grid is determined by weight and the vertical axis of the grid is determined by habit. Hence facet_grid(habit ~ marital), (y ~ x).

If we just wanted to facet by one variable and put the facets next to each other, we would only use the x side of the formula:

ggplot(data = nc_wo_na, aes(x = weight, y = gained)) +
  geom_point() +
  facet_grid(. ~ marital)
## Warning: Removed 26 rows containing missing values (geom_point).

Or if we just wanted to facet by one variable and put the facets underneath each other, we would only use the y side of the formula:

ggplot(data = nc_wo_na, aes(x = weight, y = gained)) +
  geom_point() +
  facet_grid(habit ~ .)
## Warning: Removed 26 rows containing missing values (geom_point).

bcrf dataset

To use this dataset, which is a CSV file, download it from http://stat.duke.edu/~mc301/data/bcrf.csv upload to RStudio, and load it with

bcrf <- read.csv("bcrf.csv", stringsAsFactors = FALSE)

Remember that this dataset included information on women who went through breast cancer screening in 2009.

  1. Create a visualization that involves three variables from this dataset, and comment on whether the visualization provides a meaningful insight into the data. If so, describe what it is.

  2. Create a visualization that involves four variables from this dataset, and comment on whether the visualization provides a meaningful insight into the data. If so, describe what it is.