Goals

Develop proficiency with data-wrangling functions
Answer real research questions using related datasets

General guidelines

For this assignment you must have at least three meaningful commits and all of your code chunks must have informative names.
For your first commit, update your author name in the YAML header of the template R Markdown file.
All plots should follow the best visualization practices discussed in lecture, including an informative title, labeled axes, and careful consideration of aesthetic choices.
All code should follow the tidyverse style guidelines, including not exceeding the 80 character limit.

These are large datasets so do not print out the full tibbles. Doing so will cause problems when you knit to PDF.

For every join function you should explicitly specify the by argument

Clone assignment repo and start new project

Accept and create your private repository of the assignment at https://classroom.github.com/a/jpue2_A1
Clone the repository and open a new project in RStudio.

Pitching analysis

We will work with the tidyverse package as usual.

library(tidyverse)

In this assignment you will work with four related datasets to answer research questions about baseball. The datasets are available in the data folder and are adapted from the Lahman package in R. This package contains a huge amount of information about batting, pitching, fielding, teams, etc. We will only examine a small portion, but I encourage you to explore on your own when you have finished the assignment.

No knowledge of baseball is required to complete this homework other than the definitions provided below. If you want more information check out the link here: https://www.youtube.com/watch?v=skOsApsF0jQ.

The pitcher throws the baseball from the pitcher’s mound towards the catcher, and the batter tries to hit it. If the pitch is successful and the batter misses (or doesn’t swing) it is called a strike. Three times is a strikeout.

An earned run occurs when a player advances around all of the bases without errors by the defense. The earned run average (era) is the average of earned runs given up by a pitcher per nine innings pitched (the standard length of a game of baseball). Low values indicate a good pitcher.

\[\text{earned run average} = 9 \times \dfrac{\text{earned runs allowed}}{\text{innings pitched}}\]

A brief description of the datasets and how they are related to each other is provided below.

The pitching dataset contains information on Major League Baseball pitchers. Observations are uniquely identified by player_id, year_id, and stint.

player_id: player ID code
year_id: year
stint: player’s order of appearances within a season on different teams
team_id: team
league_id: league
earned_runs: earned runs
strike_outs: strikeouts
innings_pitched: innings pitched

The people dataset contains player characteristics. Observations are uniquely identified by player_id.

player_id: player ID code
birth_year: birth year of player
first_name: player first name
last_name: player last name

The salaries dataset contains salaries of players. Observations are uniquely identified by year_id, team_id, and player_id.

year_id: year
team_id: team
player_id: player ID code
league_id: league
salary: salary

The teams dataset contains yearly statistics and standings for teams. Observations are uniquely identified by year_id and team_id.

year_id: year
team_id: team
games: games played
ws_win: World Series winner (Y or N)
wins: wins
team_name: team name

Win percentage and spending

Use the teams data to find each team’s win percentage for the years 2011 through 2016. Save the result as a tibble named team_stats. This tibble should have three columns (year_id, team_id, and win_pct) and 180 rows (6 years \(\times\) 30 teams = 180 rows).
Use an appropriate join function to add salary information to players in the pitching data. You should only include observations that appear in both pitching and salaries. Then, using a group_by() paired with a summarize(), create a new column giving the total amount of money (sum()) spent on pitching salaries by each team in each year. Save the result as a tibble named team_spending. This dataset should have three columns (year_id, team_id, and pitching_salaries) and 918 rows.
Use an appropriate function to join team_stats and team_spending and use faceting to create subplots of win percentage versus pitching spending for each year.

Historical strikeout trends

Find the pitcher with the most strikeouts (strike_outs in the pitching dataset) in each year, by league, from 1901 to the most recent year in the dataset. Save the result as a tibble named top_so_by_year and glimpse() the dataset. Your final dataset should have 243 rows (there are some ties) and the columns year_id, league_id,first_name, and last_name, where the name information is from the people dataset.

Top pitchers

Some pitchers play for more than one team in a single year. For example, Aroldis Chapman (chapmar01) played with both the World Series winning 2016 Chicago Cubs and the New York Yankees in 2016. During his first stint with the Yankees he had 7 earned runs and 31.1 innings pitched, and in his second stint with the Cubs he had 3 earned runs and 26.2 innings pitched.

Create a new dataset containing the yearly earned run average for each player, summarizing across all teams played within each year. Name your dataset player_era. You should have columns player_id, year_id, total_strike_outs, and era, where total_strike_outs refers to the total number of strikeouts for that year and era refers to the earned run average for that year. Chapman should have a single row for 2016 and his 2016 earned run average (era) should be 1.57.

\[9 * \dfrac{7+3}{31.1 + 26.2} \approx 1.57\]

Join the people data to the player_era data and mutate() a new variable with each player’s approximate age. Call this dataset player_era_age. Then, create a new dataset called pitching_stats_by_age with the median earned run average, median number of strike outs, and count of players for all possible ages using group_by() and summarize().
Construct a plot of median strike outs versus age with points sized by how many pitchers are that age. Describe what you observe.

Choosing the top pitchers of all time is a contentious topic, but a good argument can be made for the following players: Sandy Koufax, Clayton Kershaw, Bob Gibson, Roger Clemens, Greg Maddux, Cy Young, Walter Johnson, Randy Johnson, and Christy Mathewson. The player_id for each of these players is provided below.

top_players <- c("koufasa01", "kershcl01", "gibsobo01", 
                 "clemero02", "maddugr01", "youngcy01", 
                 "johnswa01", "johnsra05", "mathech01")

Filter the player_era_age data so that it only includes the top pitchers above and use faceting to create subplots of era versus age for each player. Label the subplots by “Last Name, First Name” instead of player_id (the paste() function will be helpful here). Briefly describe what you observe.

Submission

Knit to PDF to create a PDF document. Stage and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.

Only upload your PDF document to Gradescope. Before you submit the uploaded document, mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages. Associate the “Overall” section with the first page.

Homework 02: Data wrangling and joins

Due: Friday, February 12 11:59pm ET