In the following exercises we will reuse earlier datasets from the class and update and customize our visualizations to make them more informative and more visually appealing.
In this lab we will explore the data using the dplyr
package and visualize it using the ggplot2
package for data visualization. We will also need the oilabs
package for some of the datasets.
Let’s load the packages.
library(dplyr)
library(ggplot2)
library(oilabs)
We will be using a markdown language, R Markdown, to type up the report. This allows you to complete your data analysis entirely in RStudio as well as ensuring reproducibility of your analysis and results. To help get you started we are providing a template for you. Use the following code to download this template:
download.file("http://stat.duke.edu/~mc301/ARTSCI101_Su16/post/r/dataviz_customize_template.Rmd", destfile = "dataviz_customize.Rmd")
You will see a new file called dataviz_custom.Rmd
in the Files tab on the pane in the bottom right corner of your RStudio window. We will refer to this as your “R markdown file” or “your report”. Click on the file name to open the file. All you need to do to complete the lab is to type up your brief answers and the R code (when necessary) in the spaces chunks provided in the document.
Before you keep going type your name and update the date, and then click on Knit HTML.
arbuthnot
dataLoad the data:
data(arbuthnot)
Remember, this dataset contains information on number of girls and boys baptised in 18th century London. The data are collected by Dr. Arbuthnot, hence the name of the dataset.
Not a fan of the gray background? Use the black and white theme with theme_bw()
:
ggplot(data = arbuthnot, aes(x = year, y = girls)) +
geom_line() +
theme_bw()
You can also change axis limits with xlim()
and ylim()
:
ggplot(data = arbuthnot, aes(x = year, y = girls)) +
geom_line() +
theme_bw() +
xlim(1629, 1710) +
ylim(2500, 8000)
Remember, last week we also learned how to change the axis labels:
ggplot(data = arbuthnot, aes(x = year, y = girls)) +
geom_line() +
theme_bw() +
xlim(1629, 1710) +
ylim(2500, 8000) +
labs(x = "Year", y = "Number of girls baptised",
title = "Number of girls baptised between 1629 and 1710")
present
datasetLoad the data:
data(present)
Remember, this dataset contains information on number of girls and boys born in the current day United States. The data come from census records.
present
dataset. Adjust the x and y axis limits, rename the axes, and the title of the plot. You can use the gray (default) or the black and white theme, decision is up to you. Hint: You will need to first create a new variable that records the proportion of girls born for each year (refer to previous instructions if you need a refresher on how to do that and work with your teammates), and then use this variable in your plot.You might also be interested in resizing your plots in your report. You can do this by customizing your R chunk (as opposed to the R code inside the chunk).
fig.height=3, fig.width=4
. Next time you want to change your figure width you can either directly type code like this (with the measurements you desire) or use the pop-up options menu.nc
datasetLoad the data:
data(nc)
Remember, this dataset contains information on a random sample of 1000 births in North Carolina.
Suppose you want to evaluate the relationship between two numerical variables conditioned on a categorical variable. One option is to use a scatterplot where the points are colored according to the levels of the categorical variable.
Let’s plot weight gained vs. age of mother, conditioning on whether the mother was categorized as mature or younger.
ggplot(data = nc, aes(x = mage, y = gained, color = mature)) +
geom_point()
## Warning: Removed 27 rows containing missing values (geom_point).
You can also customize the title of the legend. Note that we do this using the color
argument in the labs()
function since the title of the legend in this plot is determined by the variable that we are using to color the
ggplot(data = nc, aes(x = mage, y = gained, color = mature)) +
geom_point() +
labs(color = "Maturity status")
## Warning: Removed 27 rows containing missing values (geom_point).
Include the plot we just described and describe what is wrong with it.
Now make a more useful plot: Plot gained
versus weight
, conditioning on whether the mother was a smoker or not (color = habit
). Also, customize the x and y axis labels, as well as the title of the legend.
Suppose we also want to see what part, if any, age of the mother plays in this relationship. Since we already assigned the color of the points to another variable, we need to use this variable in another way. For example, we can size the points according to the magnitude of the baby’s weight.
ggplot(data = nc, aes(x = weight, y = gained, color = habit, size = mage)) +
geom_point() +
labs(color = "Smoking habit")
## Warning: Removed 27 rows containing missing values (geom_point).
This is causing a lot of overplotting, so we might also make the points somewhat transparent so that we can see where the data density is higher. We can do this by changing the alpha
level of the points.
ggplot(data = nc, aes(x = weight, y = gained, color = habit, size = mage)) +
geom_point(alpha = 0.5) +
labs(color = "Smoking habit")
## Warning: Removed 27 rows containing missing values (geom_point).
Note that the alpha
argument went in the geom_point()
function while the habit
and mage
went in the aes()
(aesthetics) function. Why? Because the alpha level (transparency) of the points is something we are “hardcoding” while the other aesthetic features of the data depend on specific variables in the dataset.
weight
versus gained
where the colors of the points are determined by habit
and size of the points are determined by mage
. Then, set the alpha level to 0.25 (even more transparent). Adjust the x and y axis labels, as well as the titles of the legends.As you can see, all of this is doable. But is it a good idea to do it? Sizing the points by mother’s age doesn’t seem to tell us anything. In these exercises we’re teaching you some syntax (how to do things in R), but when you’re working on your project you will need to decide whether to do things. Don’t hesitate to try out a variety of plots, but don’t plan on including every single plot you try out in your final project. Always ask yourself whether the plot you made tells or reveals something meaningful about your data and your research question.
Another option for conditioning by a third or fourth variable is faceting.
ggplot(data = nc, aes(x = mage, y = gained)) +
geom_point() +
facet_grid(habit ~ marital)
## Warning: Removed 27 rows containing missing values (geom_point).
The NA
s are getting in the way of the visualization, so we might first want to drop them:
nc_wo_na <- nc %>%
filter(!is.na(marital)) %>%
filter(!is.na(habit))
ggplot(data = nc_wo_na, aes(x = weight, y = gained)) +
geom_point() +
facet_grid(habit ~ marital)
## Warning: Removed 26 rows containing missing values (geom_point).
Now that we have some data in each of the facets let’s get back to the faceting syntax. Here we created a grid of facets where the horizontal axis of the grid is determined by weight
and the vertical axis of the grid is determined by habit
. Hence facet_grid(habit ~ marital)
, (y ~ x).
If we just wanted to facet by one variable and put the facets next to each other, we would only use the x side of the formula:
ggplot(data = nc_wo_na, aes(x = weight, y = gained)) +
geom_point() +
facet_grid(. ~ marital)
## Warning: Removed 26 rows containing missing values (geom_point).
Or if we just wanted to facet by one variable and put the facets underneath each other, we would only use the y side of the formula:
ggplot(data = nc_wo_na, aes(x = weight, y = gained)) +
geom_point() +
facet_grid(habit ~ .)
## Warning: Removed 26 rows containing missing values (geom_point).
bcrf
datasetTo use this dataset, which is a CSV file, download it from http://stat.duke.edu/~mc301/data/bcrf.csv upload to RStudio, and load it with
bcrf <- read.csv("bcrf.csv", stringsAsFactors = FALSE)
Remember that this dataset included information on women who went through breast cancer screening in 2009.
Create a visualization that involves three variables from this dataset, and comment on whether the visualization provides a meaningful insight into the data. If so, describe what it is.
Create a visualization that involves four variables from this dataset, and comment on whether the visualization provides a meaningful insight into the data. If so, describe what it is.