To illustrate how bar plots, segmented bar plots, and mosaic plots can be created using MLB PITCHf/x data, I’ve provided four examples below, looking at all pitches thrown by Clayton Kershaw thrown on Monday. First, let’s load the data, our required packages, and filter for pitches thrown by Clayton Kershaw. We see that he throws four pitches, and rarely throws the changeup.
# Load Data
load(url("http://stat.duke.edu/courses/Summer17/sta101.001-2/uploads/project/mondayBaseball.Rdata"))
# Load Packages
library(ggplot2)
library(dplyr)
# Look at pitches for Clayton Kershaw
kershaw <- mondayBaseball %>%
filter(pitcherName == "Clayton Kershaw") %>%
droplevels() # Removing pitch types that Clayton Kershaw does not throw
table(kershaw$pitchType)
##
## CH CU FF SL
## 6 62 201 122
Often times, we’re interested in visualizing the distribution of a categorical variable. We can do this with a simple bar plot. Below, we make a barplot that summarizes the distribution of pitch types thrown by Clayton Kershaw.
ggplot(kershaw, aes(x = pitchType, fill = pitchType)) +
geom_bar() +
scale_fill_discrete(name = "Pitch Type") + # Legend title
xlab("Pitch Type") + # X axis title
ylab("Frequency") # Y axis title
We’re not always after frequencies when creating barplots. Sometimes, we may want to compare means of a numerical variable across multiple categories, taking into account the uncertainty associated with it. The below graph visualizes the mean spin rate of each of Clayton Kershaw’s pitches, with standard error bars to capture the uncertainty.
kershaw_spin <- kershaw %>%
filter(!(is.na(spinRate))) %>% # remove missing values
group_by(pitchType) %>%
summarise(mean_spin = mean(spinRate), se_spin = sd(spinRate)/sqrt(n())) # find mean and sd of spin rate for each pitch type
ggplot(kershaw_spin, aes(x = pitchType, y = mean_spin, fill = pitchType)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = mean_spin - se_spin, ymax = mean_spin + se_spin), width = 0.2) +
xlab("Pitch Type") +
ylab("Mean Spin Rate") +
scale_fill_discrete("Pitch Type") +
coord_cartesian(ylim = c(2000, 2500)) # only look at y values between 2000 and 2500
Creating a mosaic plot with the ggplot2
package in R is difficult. To help us out, let’s load the ggmosaic
package.
library(ggmosaic)
Let’s try to visualize the distribution of Clayton Kershaw’s pitch types by count (If there are two balls and two strikes on the batter before the pitch is thrown, the count is 2-2). First, let’s make the count variable, and order the levels to make the plot more informative.
kershaw <- kershaw %>%
mutate(count = factor(paste0(balls, "-", strikes), levels = c("0-0", "1-0", "2-0", "3-0", "3-1",
"2-1", "1-1", "2-2", "3-2", "0-1", "1-2", "0-2")))
And the following code generates the mosaic plot.
ggplot(kershaw) +
geom_mosaic(aes( x = product(pitchType, count), fill=pitchType)) +
xlab("Count") +
ylab("Proportion") +
scale_fill_discrete("Pitch Type")
Another way to visualize the relationship between two categorical variables is with a segmented bar plot. The following code will show the relationship between count and pitch type. I apologize for the poor formatting of this graph; I’m trying to get this to you guys quickly.
ggplot(kershaw) +
geom_col(aes(x = count, y = pitchType, fill = pitchType))