Packages

library(tidyverse)

Data

MLB

teams <- read_csv("http://www2.stat.duke.edu/~sms185/data/mlb/teams.csv")

Object teams is a data frame that contains yearly statistics and standings for MLB teams from 2009 to 2018.

The data has 300 rows and 56 variables.

Energy

energy <- read_csv("http://www2.stat.duke.edu/~sms185/data/energy/energy.csv")

The power sources represent the amount of energy a power source generates each day as represented in daily MWh.

  • MWhperDay: MWh of energy generated per day
  • name: energy source name
  • type: type of energy source
  • location: country of energy source
  • note: more details on energy source
  • boe: barrel of oil equivalent


  • Daily megawatt hour (MWh) is a measure of energy output.
  • 1 MWh is, on average, enough power for 28 people in the USA

Flint water

flint <- read_csv("http://www2.stat.duke.edu/~sms185/data/health/flint.csv")

Each row represents a home in Flint, Michigan. Water lead contaminant value were recorded at three times as represented by draw1, draw2, and draw3.

Exercise 1

Problem

Use tibble teams to re-create the plot below.

Solution

ggplot(data = teams, mapping = aes(x = SO, y = R, color = factor(DivWin))) +
  geom_point(size = 3, alpha = .8) +
  facet_wrap(~yearID, nrow = 2) +
  labs(x = "Strike outs", y = "Runs", color = "Division winner")

Exercise 2

Problem

Try to improve the visualization in Exercise 1 by drawing attention to the division winners and their relationship between runs and strikeouts.

Solution

ggplot(data = teams, mapping = aes(x = SO, y = R, color = factor(DivWin))) +
  geom_point(size = 2, alpha = .8) +
  geom_hline(yintercept = 750, lty = 2, alpha = .5, color = "blue") +
  geom_vline(xintercept = 1250, lty = 2, alpha = .5, color = "blue") +
  facet_wrap(~yearID, nrow = 2) +
  labs(x = "Strike outs", y = "Runs", color = "Division winner",
       title = "Division winners generally score more runs",
       subtitle = "and have fewer strike outs") +
  scale_color_manual(values = c("grey", "red")) +
  scale_x_continuous(limits = c(750, 1750), breaks = seq(900, 1700, 350),
                     labels = seq(900, 1700, 350)) +
  scale_y_continuous(limits = c(500, 1000), breaks = seq(500, 1000, 100),
                     labels = seq(500, 1000, 100)) +
  theme_bw(base_size = 16) +
  theme(legend.position = "bottom")

Exercise 3

Problem

Re-create the plot below using energy.

A few notes:

  • base font size is 18

  • hex colors: c("#9d8b7e", "#315a70", "#66344c", "#678b93", "#b5cfe1", "#ffcccc")

  • use function order() to help get the top 30

Starter code:

energy_top_30 <- energy[order(energy$MWhperDay, decreasing = T)[1:30], ]

Solution

ggplot(energy_top_30, mapping = aes(x = reorder(name, MWhperDay), 
                                    y = MWhperDay / 1000, 
                                    fill = type)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("#9d8b7e", "#315a70", "#66344c",
                               "#678b93", "#b5cfe1", "#ffcccc")) +
  theme_bw(base_size = 18) +
  labs(y       = "Daily MWh (in thousands)", x = "Power Source",
       title   = "Top 30 power source energy generators",
       fill    = "Power Source",
       caption = "1 MWh is, on average, enough power for 28 people in the USA") +
  coord_flip()

Exercise 4

Problem

Recreate the plot below using flint.

Some details to help you replicate the plot:

  • remove zip codes 48529 and 48502
  • color choices: red, “#256d7b”
  • font size: 16
  • theme_bw()

Solution

flint %>% 
  filter(!(zip %in% c(48529, 48502))) %>% 
  ggplot(mapping = aes(x = reorder(factor(zip), draw1, quantile, .75), y = draw1)) +
  geom_boxplot(fill = "#256d7b", alpha = 0.7) +
  scale_y_continuous(breaks = seq(0, 165, 15), labels = seq(0, 165, 15)) +
  coord_flip() +
  geom_hline(yintercept = 15, color = "red", alpha = 0.7, 
             linetype = 2, size = 1.25) +
  annotate("text", y = 45, x = .75, label = "EPA action level, 15 ppb",
           color = "red", alpha = 0.7, size = 6) +
  labs(x = "Zip code", y = "Lead content (ppb)",
       caption = "Action level for lead is when 15 ppb is in more than 10% of customer taps sampled", title = "First draw of lead samples in Flint, MI homes") +
  theme_bw(base_size = 16)