Homework 01: Data visualization

Due: Friday, February 5 11:59pm ET

Goals

Clone assignment repo and start new project

Packages

In this assignment, we will work with the tidyverse package.

library(tidyverse)

Diamond prices

In this assignment, you will perform an investigation of diamond prices based on 1,000 diamonds. Build effective and well-labeled visualizations to answer the questions below. For each question, show your code and output, and write your answers in complete sentences.

All plots should follow the best visualization practices discussed in lecture. Plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.

In addition, code and narrative should not exceed an 80 character per line limit.

For this assignment you must have at least three commits and all of your code chunks must have meaningful names. You may choose when you want to make your commits.

We will only examine a subset of the data, so include the code below in a code chunk at the start of your R Markdown file.

set.seed(1)
diamonds_subset <- diamonds %>%
  filter(carat <= 2.5) %>%
  sample_n(1000)
  1. How many rows are in the diamonds_subset dataset? How many columns?

  2. Examine the documentation of the diamonds dataset by running ?diamonds in the console. What is the meaning of clarity? What is the worst clarity? What is the meaning of color? What is the best color? Note: we are investigating only a subset of the data (so 1,000 not over 50,000 diamonds).

  3. Construct a scatterplot of price versus carat. Describe the relationship.

  4. Color the points in the price versus carat scatterplot by the diamond’s color. Describe the relationship.

  5. Add a geom_smooth() for each color and add the argument se = FALSE to omit the bands surrounding the smoothed fit.

  6. Examine the relationship between price and carat by clarity, using a separate scatterplot for each clarity.

  7. Create a bar chart showing all of the colors, with the count of diamonds on the y-axis.

  8. Create a segmented bar chart showing one bar per color, each bar going from 0 - 1, with the fill determined by cut.

  9. Create a segmented bar chart showing one bar per color, each bar going from 0 - 1, with the fill determined by price. Does this plot work? Why or why not?

  10. Create side-by-side boxplots of price for each color and comment on the relationship. Then construct a violin plot using geom_violin(). What do the violin plots reveal that boxplots do not? What do boxplots reveal that violin plots do not?

  11. Come up with a research question based on these data and write it down. Then, create an effective data visualization that answers the question and write a brief paragraph explaining how your visualization answers the question. Your plot should be substantially and noticeably different from the plots you created above. Do not simply switch variables or make a minor modification. Be creative and have fun!

Submission

Knit to PDF to create a PDF document. Stage and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.

Only upload your PDF document to Gradescope. Before you submit the uploaded document, mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages. Associate the “Overall” section with the first page.