HW 02 - Confounding and Simpson’s Paradox

Due: Thursday, Feb 13 at 11:59pm

This homework is based on a study conducted in Whickham, England, a mixed urban and rural district near Newcastle-upon-Tyne. The survey was conducted in the 1970s to study heart and thyroid disease, and then 20 years later a follow- up was conducted.

As you work on this assignment, don’t forget to give your code chunks meaningful names, and follow good coding conventions as described during lecture and lab ( part of your grade will be whether they are followed).

Getting Started

To accept this assignment click here: https://classroom.github.com/a/PARuVkbT. Navigate to your repository beginning with hw2- and clone it into RStudio Cloud. Configure git by using the use_git_config() function in the usethis package, and finally, give your project a meaningful name (hw2-[name]). You may cache your login credentials, but remember that it is only stored for this single project.

Creating a README file in GitHub

This homework assignment repository contains a template README file that you can edit. There are two ways that you can edit the README.

As part of this assignment, create a meaningful README file in your repository (you may want to save this step for last).

Packages and data

In this assignment we will work with the tidyverse and mosaicData packages. The data set we will be using is called Whickham and can be loaded into memory after loading the mosaicData package. Below is some code to get you started.



  1. Do you think these data come from an observational study or an experiment? Why? Check out the help documentation for this data.

  2. How many observations are in this data set, and what does each observation represent?

  3. How many variables are in this data set, and what R data type is each variable?

  4. What would you expect the relationship between smoking status and health outcome to be? Which is the explanatory variable, and which is the response variable?

  5. Create a visualization depicting the relationship between smoking status and health outcome. Briefly describe this relationship and whether it meets your expectations in Exercise 4. In doing so, calculate the relevant conditional probabilities to help your narrative. Here is some code to get you started:

  6. How many smokers and non-smokers at the time of the survey were between 45 and 64 years of age inclusive? Display your result as a data frame with variable smoker and the corresponding counts for each smoker level.

  7. Create a new variable which categorizes age based on the following scheme:

    • age 44 or younger: “18-44”
    • age 45 to 64, inclusive: “45-64”
    • age 65 or older: “65+”
  8. Consider this new variable as a possible explanation of the data. What has changed, and what might explain this change? Create a data visualization to support your narrative.


Knit to PDF to create a PDF document. Stage and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.

Please only upload your PDF document to Gradescope. Before you submit the uploaded document, mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages. Make sure to associate the “Overall” section with the first page.