Homework 02: Data wrangling and joins

Due: Friday, February 12 11:59pm ET

Goals

General guidelines

These are large datasets so do not print out the full tibbles. Doing so will cause problems when you knit to PDF.

For every join function you should explicitly specify the by argument

Clone assignment repo and start new project

Pitching analysis

We will work with the tidyverse package as usual.

library(tidyverse)

In this assignment you will work with four related datasets to answer research questions about baseball. The datasets are available in the data folder and are adapted from the Lahman package in R. This package contains a huge amount of information about batting, pitching, fielding, teams, etc. We will only examine a small portion, but I encourage you to explore on your own when you have finished the assignment.

No knowledge of baseball is required to complete this homework other than the definitions provided below. If you want more information check out the link here: https://www.youtube.com/watch?v=skOsApsF0jQ.

The pitcher throws the baseball from the pitcher’s mound towards the catcher, and the batter tries to hit it. If the pitch is successful and the batter misses (or doesn’t swing) it is called a strike. Three times is a strikeout.

An earned run occurs when a player advances around all of the bases without errors by the defense. The earned run average (era) is the average of earned runs given up by a pitcher per nine innings pitched (the standard length of a game of baseball). Low values indicate a good pitcher.

\[\text{earned run average} = 9 \times \dfrac{\text{earned runs allowed}}{\text{innings pitched}}\]

A brief description of the datasets and how they are related to each other is provided below.

The pitching dataset contains information on Major League Baseball pitchers. Observations are uniquely identified by player_id, year_id, and stint.

The people dataset contains player characteristics. Observations are uniquely identified by player_id.

The salaries dataset contains salaries of players. Observations are uniquely identified by year_id, team_id, and player_id.

The teams dataset contains yearly statistics and standings for teams. Observations are uniquely identified by year_id and team_id.

Win percentage and spending

  1. Use the teams data to find each team’s win percentage for the years 2011 through 2016. Save the result as a tibble named team_stats. This tibble should have three columns (year_id, team_id, and win_pct) and 180 rows (6 years \(\times\) 30 teams = 180 rows).

  2. Use an appropriate join function to add salary information to players in the pitching data. You should only include observations that appear in both pitching and salaries. Then, using a group_by() paired with a summarize(), create a new column giving the total amount of money (sum()) spent on pitching salaries by each team in each year. Save the result as a tibble named team_spending. This dataset should have three columns (year_id, team_id, and pitching_salaries) and 918 rows.

  3. Use an appropriate function to join team_stats and team_spending and use faceting to create subplots of win percentage versus pitching spending for each year.

Top pitchers

Some pitchers play for more than one team in a single year. For example, Aroldis Chapman (chapmar01) played with both the World Series winning 2016 Chicago Cubs and the New York Yankees in 2016. During his first stint with the Yankees he had 7 earned runs and 31.1 innings pitched, and in his second stint with the Cubs he had 3 earned runs and 26.2 innings pitched.

  1. Create a new dataset containing the yearly earned run average for each player, summarizing across all teams played within each year. Name your dataset player_era. You should have columns player_id, year_id, total_strike_outs, and era, where total_strike_outs refers to the total number of strikeouts for that year and era refers to the earned run average for that year. Chapman should have a single row for 2016 and his 2016 earned run average (era) should be 1.57.

\[9 * \dfrac{7+3}{31.1 + 26.2} \approx 1.57\]

  1. Join the people data to the player_era data and mutate() a new variable with each player’s approximate age. Call this dataset player_era_age. Then, create a new dataset called pitching_stats_by_age with the median earned run average, median number of strike outs, and count of players for all possible ages using group_by() and summarize().

  2. Construct a plot of median strike outs versus age with points sized by how many pitchers are that age. Describe what you observe.

Choosing the top pitchers of all time is a contentious topic, but a good argument can be made for the following players: Sandy Koufax, Clayton Kershaw, Bob Gibson, Roger Clemens, Greg Maddux, Cy Young, Walter Johnson, Randy Johnson, and Christy Mathewson. The player_id for each of these players is provided below.

top_players <- c("koufasa01", "kershcl01", "gibsobo01", 
                 "clemero02", "maddugr01", "youngcy01", 
                 "johnswa01", "johnsra05", "mathech01")
  1. Filter the player_era_age data so that it only includes the top pitchers above and use faceting to create subplots of era versus age for each player. Label the subplots by “Last Name, First Name” instead of player_id (the paste() function will be helpful here). Briefly describe what you observe.

Submission

Knit to PDF to create a PDF document. Stage and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.

Only upload your PDF document to Gradescope. Before you submit the uploaded document, mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages. Associate the “Overall” section with the first page.