In the following exercises we will learn to visualize and summarize relationships between categorical variables.

Getting Started

Load packages

In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization.

Let’s load the packages.

library(dplyr)
library(ggplot2)

Creating a reproducible report

We will be using a markdown language, R Markdown, to type up the report. This allows you to complete your data analysis entirely in RStudio as well as ensuring reproducibility of your analysis and results. To help get you started we are providing a template for you. Use the following code to download this template:

download.file("http://stat.duke.edu/~mc301/ARTSCI101_Su16/post/r/bc_risk_factors_template.Rmd", destfile = "bc_risk_factors.Rmd")

You will see a new file called bc_risk_factors.Rmd in the Files tab on the pane in the bottom right corner of your RStudio window. We will refer to this as your “R markdown file” or “your report”. Click on the file name to open the file. All you need to do to complete the lab is to type up your brief answers and the R code (when necessary) in the spaces chunks provided in the document.

Before you keep going type your team name, and then click on Knit HTML. You’ll see your compiled document in a new pop-up window.

The data

The Breast Cancer Surveillance Consortium (BCSC) is a research resource for studies designed to assess the delivery and quality of breast cancer screening and related patient outcomes in the United States.

The BCSC releases a variety of datasets for public use. We will work with one of these: the risk factors dataset. While the original dataset includes information from a large number of examinations (millions) receiving mammography within the BCSC between January 2000 and December 2009. We will work with a random sample of 5000 observations from 2009 only. The BSCS notes the following about the dataset.

This dataset may be useful to people interested in exploring the distribution of breast cancer risk factors in US women. The dataset includes participant characteristics that have been previously shown to be associated with breast cancer risk, including age, race/ethnicity, family history of breast cancer, age at menarche, age at first birth, breast density, use of hormone replacement therapy, menopausal status, BMI, history of biopsy, and history of breast cancer. These data can be used to provide information on the distribution of breast cancer risk in the general population or to explore relationships among breast cancer risk factors. See the Risk Factors Dataset Documentation for more information about the variables included in the dataset. (Source: http://breastscreening.cancer.gov/data/rf/)

We are going to use a new way of loading a dataset.

bcrf <- read.csv("bcrf.csv", stringsAsFactors = FALSE)

We have observations on 12 different variables, some categorical and some numerical. The meaning of each variable is as follows.

variable description
year 2009 (all observations in this dataset are from 2009)
age Age group
race_eth Race/ethnicity
first_degree_hx History of breast cancer in a first degree relative
age_menarche Age (years) at menarche
age_first_birth Age (years) at first birth
BIRADS_breast_density BI-RADS breast density
current_hrt Use of hormone replacement therapy
menopaus Menopausal status
bmi_group Body mass index
biophx Previous breast biopsy or aspiration
breast_cancer_history Prior breast cancer diagnosis
  1. What are the cases in this data set? How many cases are there in our sample?

You can answer this question by viewing the data in the data viewer or by using the following command:

str(bcrf)

As you review the variables, consider which variables are categorical and which are numerical.

  1. Determine whether each variable in this dataset is numerical or categorical. Classify numerical variables further as continuous or discrete. Classify categorical variables further as ordinal or not ordinal.

History of breast cancer in a first degree relative

Let’s first take a look at the distribution of history of breast cancer in a first degree relative (first_degree_hx). A bar plot is a useful visualization for a single categorical variable.

ggplot(data = bcrf, aes(x = first_degree_hx)) +
  geom_bar()

Now we have a bar plot, but the order of the bars do not really make sense. Note that R, by default, orders the bars in alphabetical order of the levels. We might want a more meaningful order like No, Yes, Unknown. In order to do this we first need to change the ordering of the levels in the data, and then re-plot.

bcrf <- bcrf %>%
  mutate(first_degree_hx = factor(first_degree_hx, levels = c("No", "Yes", "Unknown")))
ggplot(data = bcrf, aes(x = first_degree_hx)) +
  geom_bar()

Looking better. Another optional adjustment that would improve the plot would be to update the x-axis label.

ggplot(data = bcrf, aes(x = first_degree_hx)) +
  geom_bar() +
  labs(x = "History of breast cancer in a first degree relative")
  1. Re-plot the barplot, making sure the levels of the first_degree_hx are shown in the order No, Yes, Unknown, the x-axis label reads “History of breast cancer in a first degree relative”, and the y-axis label reads “Count” (with a capital c). Hint: You just need to add a new argument to the labs() layer.

To summarize these data numerically we use a frequency table.

bcrf %>%
  select(first_degree_hx) %>%
  table()

We can also easily obtain the relative frequencies.

bcrf %>%
  select(first_degree_hx) %>%
  table() %>%
  prop.table()

History of breast cancer in a first degree relative and prior breast cancer diagnosis

Next we explore the relationship between history of breast cancer in a first degree relative (first_degree_hx) and prior breast cancer diagnosis (breast_cancer_history). We can do this using segmented bar plots.

ggplot(data = bcrf, aes(x = first_degree_hx, fill = breast_cancer_history)) +
  geom_bar()
  1. Reorder the levels of the breast_cancer_history variable as No, Yes, Unknown, and re-plot this segmented bar plot.

This segmented bar plot tells us about the frequency distribution of prior breast cancer diagnosis conditional on history of breast cancer in a first degree relative. However if we want to evaluate the relationship between these two variables it would be more useful to visualize the conditional probabilities (as opposed to frequencies).

ggplot(data = bcrf, aes(x = first_degree_hx, fill = breast_cancer_history)) +
  geom_bar(position = "fill")

We can make our visualization a bit tidier by updating the legent title:

ggplot(data = bcrf, aes(x = first_degree_hx, fill = breast_cancer_history)) +
  geom_bar(position = "fill") +
  labs(fill = "Prior breast cancer diagnosis")
  1. Re-plot the stacked bar plot showing the conditional probabilities with better x- and y-axis labels. For the x-axis you can use the same label we used before. For the y-axis use “Relative frequency”. Then, describe the relationship between the two variables, making sure to comment on whether it is more likely to have had prior breast cancer diagnosis if the patient does or does not have a history of breast cancer in a first degree relative.

To summarize these data numerically we use a contingency table.

bcrf %>%
  select(first_degree_hx, breast_cancer_history) %>%
  table()
  1. Calculate the probabilities of having a positive breast cancer diagnosis for patients with and without a history of breast cancer in a first degree relative. Include the contingency table in your answer.

We can also ask R to calculate these probabilities for us.

Independence: If the conditional probabilities of one variable varies across the levels of another variable, these two variables might be dependent. In other words, there is a relationship (association) between these variables. If the conditional probabilities do not vary across the levels of another variable, these two variables are likely independent. In other words, there is no association between these variables.

  1. Do these data suggest an association between history of breast cancer in a first degree relative and prior breast cancer diagnosis? Support your answer with the relevant (conditional) probabilities.

Exploring three-way relationships

We can easily add more variables into our exploration.

Suppose we want to know how (if at all) the relationship between history of breast cancer in a first degree relative and prior breast cancer diagnosis changes dependent on whether the woman is using a hormone replacement therapy.

Before we plot the data, let’s first reorder the levels of the hormone replacement therapy in a meaningful way.

bcrf <- bcrf %>%
  mutate(current_hrt = factor(current_hrt, levels = c("No", "Yes", "Unknown")))

We can plot all three variables at once with the help of faceting.

ggplot(data = bcrf, aes(x = first_degree_hx, fill = breast_cancer_history)) +
  geom_bar(position = "fill") +
  facet_wrap(~ current_hrt) +
  labs(title = "Use of hormone replacement therapy")
  1. Replot the faceted segmented bar plots with better x- and y-axis labels as well as a better label for the legend. Then, comment on whether the relationship between history of breast cancer in a first degree relative and prior breast cancer diagnosis changes dependent on whether the woman is using a hormone replacement therapy.

Further exploration

For the following questions use the complete bcrf dataset, instead of the subgroup you created in the previous section.

  1. Pick a single categorical variable, make a bar plot of this variable, make sure the levels of the variable are ordered in a meaningful way, and adjust the labels of the x- and y-axes to show descriptions of variables in a case-matching format. Also describe the distribution of the variable using values from a relative frequency table of this variable.

  2. Pick two categorical variables, make a segmented bar plot of these variable (one that displays the conditional probabilities), make sure the levels of the variables are ordered in a meaningful way, and adjust the labels of the x- and y-axes to show descriptions of variables in a case-matching format. Also comment on whether these variables appear to be associated or not using the relevant conditional probabilities from a contingency table.

  3. Pick a third categorical variable, and re-construct the visualization from the previous exercise faceted by the levels of this variable. Make sure that all of your labels are meaningful (as opposed to just variable names from the dataset). Also comment on whether the relationship between the two variables you selected in the previous exercise appears to vary across the levels of this third variable.


Submitting your work

Locate the files you want to export in the Files pane (lower right corner). These files are called

Check the box next them, click on More -> Export, and then click on Download in the pop-up window.

Then, submit these as part of your Stats assignment 3 on Sakai. Due date is Monday, July 25