To illustrate how bar plots, segmented bar plots, and mosaic plots can be created using MLB PITCHf/x data, I’ve provided four examples below, looking at all pitches thrown by Clayton Kershaw thrown on Monday. First, let’s load the data, our required packages, and filter for pitches thrown by Clayton Kershaw. We see that he throws four pitches, and rarely throws the changeup.

# Load Data
load(url("http://stat.duke.edu/courses/Summer17/sta101.001-2/uploads/project/mondayBaseball.Rdata"))

# Load Packages
library(ggplot2)
library(dplyr)

# Look at pitches for Clayton Kershaw
kershaw <- mondayBaseball %>%
  filter(pitcherName == "Clayton Kershaw") %>%
  droplevels() # Removing pitch types that Clayton Kershaw does not throw

table(kershaw$pitchType)
## 
##  CH  CU  FF  SL 
##   6  62 201 122

Frequency Bar Plots

Often times, we’re interested in visualizing the distribution of a categorical variable. We can do this with a simple bar plot. Below, we make a barplot that summarizes the distribution of pitch types thrown by Clayton Kershaw.

ggplot(kershaw, aes(x = pitchType, fill = pitchType)) +
  geom_bar() +
  scale_fill_discrete(name = "Pitch Type") + # Legend title
  xlab("Pitch Type") + # X axis title
  ylab("Frequency") # Y axis title

Customizing the Barplot

We’re not always after frequencies when creating barplots. Sometimes, we may want to compare means of a numerical variable across multiple categories, taking into account the uncertainty associated with it. The below graph visualizes the mean spin rate of each of Clayton Kershaw’s pitches, with standard error bars to capture the uncertainty.

kershaw_spin <- kershaw %>%
  filter(!(is.na(spinRate))) %>% # remove missing values
  group_by(pitchType) %>%
  summarise(mean_spin = mean(spinRate), se_spin = sd(spinRate)/sqrt(n())) # find mean and sd of spin rate for each pitch type

ggplot(kershaw_spin, aes(x = pitchType, y = mean_spin, fill = pitchType)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = mean_spin - se_spin, ymax = mean_spin + se_spin), width = 0.2) +
  xlab("Pitch Type") +
  ylab("Mean Spin Rate") +
  scale_fill_discrete("Pitch Type") +
  coord_cartesian(ylim = c(2000, 2500)) # only look at y values between 2000 and 2500

Mosaic Plots

Creating a mosaic plot with the ggplot2 package in R is difficult. To help us out, let’s load the ggmosaic package.

library(ggmosaic)

Let’s try to visualize the distribution of Clayton Kershaw’s pitch types by count (If there are two balls and two strikes on the batter before the pitch is thrown, the count is 2-2). First, let’s make the count variable, and order the levels to make the plot more informative.

kershaw <- kershaw %>%
  mutate(count = factor(paste0(balls, "-", strikes), levels = c("0-0", "1-0", "2-0", "3-0", "3-1",
                                                                "2-1", "1-1", "2-2", "3-2", "0-1", "1-2", "0-2")))

And the following code generates the mosaic plot.

ggplot(kershaw) +
  geom_mosaic(aes( x = product(pitchType, count), fill=pitchType)) +
  xlab("Count") +
  ylab("Proportion") +
  scale_fill_discrete("Pitch Type")

Segmented Bar Plots

Another way to visualize the relationship between two categorical variables is with a segmented bar plot. The following code will show the relationship between count and pitch type. I apologize for the poor formatting of this graph; I’m trying to get this to you guys quickly.

ggplot(kershaw) +
  geom_col(aes(x = count, y = pitchType, fill = pitchType))