Lab 03: Data wrangling

Due: Thu, Feb 11 at 11:59pm ET

Goals

Clone assignment repo and start new project

General guidelines

This is an individual assignment. Your assignment should have at least three meaningful commits and all code chunks should have informative names. Don’t forget to keep all text and code within 80 characters per line.

Lego analysis

We will examine a dataset containing characteristics of lego sets manufactured between 1961 and 2019 from the BRICKSET website. Variables in the dataset are described below.

Variable Description
id set id
name name of set
themegroup themegroup of set
theme theme of set
subtheme subtheme of set
year year released
pieces number of pieces
minifigs number of minifigs
package type of packaging
retail_price recommended retail price in dollars

Load tidyverse with

library(tidyverse)

Read in the data and save it as an object named lego with

lego <- read_csv("data/lego.csv")
  1. Some sets have missing information for retail_price or pieces or both. This could be because the sets are free (giveaways), they aren’t traditional lego sets (comic books, etc) or just because the information is missing. Filter the lego dataset based on the specifications below and save the result as lego using <-. Hence, you will overwrite the original lego object. In addition, describe the implications of removing these sets.

Your new lego tibble (data frame) should have:

  1. Arrange the dataset in descending order of retail_price and print the first three rows. Report in words the names of the three most expensive lego sets, their prices, and how many pieces each has.

  2. It appears that the most expensive sets generally have more pieces. Use mutate() to create a new variable price_per_piece, representing the price in dollars per piece for each of the sets. Save the result as lego. Hence, you will overwrite the current lego object.

  3. Arrange the lego dataset in descending order of price_per_piece and return only the columns name, themegroup, theme, pieces, price_per_piece, and the first five rows. What do you notice about these sets?

  4. Return a tibble containing the cheapest and most expensive lego sets (based on retail_price) in each subtheme, considering only sets with the Lord of the Rings theme.

  5. Use group_by() and summarize() to create a new tibble with one row for each year, and columns for the year, the number of sets released in that year, and the median price per piece for sets from that year. Save this resulting tibble as an object named yearly_trends.

  6. Create a plot of the median price per piece over time using the yearly_trends tibble. Size points according to the number of sets produced in that year. Adjust transparency, color, etc as appropriate and remember the principles of effective data visualization. Comment on what you observe.

Submission

Knit to PDF to create a PDF document. Stage and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.

Only upload your PDF document to Gradescope. Before you submit the uploaded document, mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages. Associate the “Overall” section with the first page.