Use data wrangling to extract meaning from data
Practice using the seven helpful verbs (functions)
filter(): pick rows matching criteriaselect(): pick columns by namemutate(): add new variablesslice(): pick rows using indicesarrange(): reorder rowsgroup_by(): for grouped operationssummarize(): calculate summary statisticsAccept and create your private repository of the assignment at https://classroom.github.com/a/j6tXRehp
Clone the repository and open a new project in RStudio. See the first lab and recent lectures for a reminder of the steps.
This is an individual assignment. Your assignment should have at least three meaningful commits and all code chunks should have informative names. Don’t forget to keep all text and code within 80 characters per line.
We will examine a dataset containing characteristics of lego sets manufactured between 1961 and 2019 from the BRICKSET website. Variables in the dataset are described below.
| Variable | Description |
|---|---|
id |
set id |
name |
name of set |
themegroup |
themegroup of set |
theme |
theme of set |
subtheme |
subtheme of set |
year |
year released |
pieces |
number of pieces |
minifigs |
number of minifigs |
package |
type of packaging |
retail_price |
recommended retail price in dollars |
Load tidyverse with
Read in the data and save it as an object named lego with
retail_price or pieces or both. This could be because the sets are free (giveaways), they aren’t traditional lego sets (comic books, etc) or just because the information is missing. Filter the lego dataset based on the specifications below and save the result as lego using <-. Hence, you will overwrite the original lego object. In addition, describe the implications of removing these sets.Your new lego tibble (data frame) should have:
piecespiecesretail_priceretail_priceyearArrange the dataset in descending order of retail_price and print the first three rows. Report in words the names of the three most expensive lego sets, their prices, and how many pieces each has.
It appears that the most expensive sets generally have more pieces. Use mutate() to create a new variable price_per_piece, representing the price in dollars per piece for each of the sets. Save the result as lego. Hence, you will overwrite the current lego object.
Arrange the lego dataset in descending order of price_per_piece and return only the columns name, themegroup, theme, pieces, price_per_piece, and the first five rows. What do you notice about these sets?
Return a tibble containing the cheapest and most expensive lego sets (based on retail_price) in each subtheme, considering only sets with the Lord of the Rings theme.
Use group_by() and summarize() to create a new tibble with one row for each year, and columns for the year, the number of sets released in that year, and the median price per piece for sets from that year. Save this resulting tibble as an object named yearly_trends.
Create a plot of the median price per piece over time using the yearly_trends tibble. Size points according to the number of sets produced in that year. Adjust transparency, color, etc as appropriate and remember the principles of effective data visualization. Comment on what you observe.
Knit to PDF to create a PDF document. Stage and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.
Only upload your PDF document to Gradescope. Before you submit the uploaded document, mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages. Associate the “Overall” section with the first page.