Goals

Use data wrangling to extract meaning from data
Practice using the seven helpful verbs (functions)
- filter(): pick rows matching criteria
- select(): pick columns by name
- mutate(): add new variables
- slice(): pick rows using indices
- arrange(): reorder rows
- group_by(): for grouped operations
- summarize(): calculate summary statistics

Clone assignment repo and start new project

Accept and create your private repository of the assignment at https://classroom.github.com/a/j6tXRehp
Clone the repository and open a new project in RStudio. See the first lab and recent lectures for a reminder of the steps.

General guidelines

This is an individual assignment. Your assignment should have at least three meaningful commits and all code chunks should have informative names. Don’t forget to keep all text and code within 80 characters per line.

Lego analysis

We will examine a dataset containing characteristics of lego sets manufactured between 1961 and 2019 from the BRICKSET website. Variables in the dataset are described below.

Variable	Description
`id`	set id
`name`	name of set
`themegroup`	themegroup of set
`theme`	theme of set
`subtheme`	subtheme of set
`year`	year released
`pieces`	number of pieces
`minifigs`	number of minifigs
`package`	type of packaging
`retail_price`	recommended retail price in dollars

Load tidyverse with

library(tidyverse)

Read in the data and save it as an object named lego with

lego <- read_csv("data/lego.csv")

Some sets have missing information for retail_price or pieces or both. This could be because the sets are free (giveaways), they aren’t traditional lego sets (comic books, etc) or just because the information is missing. Filter the lego dataset based on the specifications below and save the result as lego using <-. Hence, you will overwrite the original lego object. In addition, describe the implications of removing these sets.

Your new lego tibble (data frame) should have:

no missing pieces
only contain sets with a nonzero number of pieces
no missing retail_price
only contain sets with a nonzero retail_price
no missing year

Arrange the dataset in descending order of retail_price and print the first three rows. Report in words the names of the three most expensive lego sets, their prices, and how many pieces each has.
It appears that the most expensive sets generally have more pieces. Use mutate() to create a new variable price_per_piece, representing the price in dollars per piece for each of the sets. Save the result as lego. Hence, you will overwrite the current lego object.
Arrange the lego dataset in descending order of price_per_piece and return only the columns name, themegroup, theme, pieces, price_per_piece, and the first five rows. What do you notice about these sets?
Return a tibble containing the cheapest and most expensive lego sets (based on retail_price) in each subtheme, considering only sets with the Lord of the Rings theme.
Use group_by() and summarize() to create a new tibble with one row for each year, and columns for the year, the number of sets released in that year, and the median price per piece for sets from that year. Save this resulting tibble as an object named yearly_trends.
Create a plot of the median price per piece over time using the yearly_trends tibble. Size points according to the number of sets produced in that year. Adjust transparency, color, etc as appropriate and remember the principles of effective data visualization. Comment on what you observe.

Submission

Knit to PDF to create a PDF document. Stage and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.

Only upload your PDF document to Gradescope. Before you submit the uploaded document, mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages. Associate the “Overall” section with the first page.

Lab 03: Data wrangling

Due: Thu, Feb 11 at 11:59pm ET

Goals

Clone assignment repo and start new project

General guidelines

Lego analysis

Submission