Lab 05 - Simpson’s paradox

2018-02-15

Due: 2018-02-22 at noon

Introduction

A study of conducted in Whickham, England recorded participants’ age, smoking status at baseline, and then 20 years later recorded their health outcome.

Getting started

Packages

In this lab we will work with the tidyverse and mosaicData packages. So we need to install and load them:

install.packages("tidyverse")
install.packages("mosaicData")
library(tidyverse) 
library(mosaicData) 

Note that these packages are also loaded in your R Markdown document.

Housekeeping

Git configuration

Your email address is the address tied to your GitHub account and your name should be first and last name.

git config --global user.email "your email"
git config --global user.name "your name"

To confirm that the changes have been implemented, run the following:

git config --global user.email
git config --global user.name

Password caching

If you would like your git password cached for a week for this project, type the following in the Terminal:

git config --global credential.helper 'cache --timeout 604800'

Project name:

Currently your project is called Untitled Project. Update the name of your project to be “Lab 05 - Simpson’s paradox”.

Warm up

Pick one team member to complete the steps in this section while the others contribute to the discussion but do not actually touch the files on their computer.

Before we introduce the data, let’s warm up with some simple exercises.

YAML:

Open the R Markdown (Rmd) file in your project, change the author name to your team name, and knit the document.

Commiting and pushing changes:

Pulling changes:

Now, the remaining team members who have not been concurrently making these changes on their projects should click on the Pull button in their Git pane and observe that the changes are now reflected on their projects as well.

The data

The data is in the mosaicData package. You can load it with

data(Whickham)

Take a peek at the codebook with

?Whickham

or at https://www.rdocumentation.org/packages/mosaicData/versions/0.14.0/topics/Whickham.

Exercises

  1. What type of study do you think these data come from: observational or experiment? Why?

  2. How many observations are in this dataset? What does each observation represent?

  3. How many variables are in this dataset? What type of variable is each? Display each variable using an appropriate visualization.

  4. What would you expect the relationship between smoking status and health outcome to be?

  5. Create a visualization depicting the relationship between smoking status and health outcome. Briefly describe the relationship, and evaluate whether this meets your expectations. Additionally, calculate the relevant conditional probabilities to help your narrative. Here is some code to get you started:

Whickham %>%
  count(smoker, outcome)
  1. Create a new variable called age_cat using the following scheme:
  1. Re-create the visualization depicting the relationship between smoking status and health outcome, faceted by age_cat. What changed? What might explain this change? Extend the contingency table from earlier by breaking it down by age category and use it to help your narrative.
Whickham %>%
  count(smoker, age_cat, outcome)