ggplot2


ggplot2

ggplot2

  • ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts.

  • It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.

Why ggplot2?

  • Statistical:
Visualize conditional relationships (\(Y \sim X | Z\)) with ease


  • Technical:
Human readable syntax
  • Bonus:
Aesthetically pleasing\(^*\)

New data - movies

The data set is comprised of 651 randomly sampled movies produced and released before 2016. Data come from IMDB and Rotten Tomatoes. Codebook is available here.

movies = read.csv("http://www.stat.duke.edu/~cr173/Sta523_Fa17/data/movies/movies.csv", 
                  stringsAsFactors = FALSE) %>% tbl_df()
movies
## # A tibble: 651 x 32
##                     title   title_type       genre runtime mpaa_rating                   studio
##                     <chr>        <chr>       <chr>   <int>       <chr>                    <chr>
##  1            Filly Brown Feature Film       Drama      80           R      Indomina Media Inc.
##  2               The Dish Feature Film       Drama     101       PG-13    Warner Bros. Pictures
##  3    Waiting for Guffman Feature Film      Comedy      84           R   Sony Pictures Classics
##  4   The Age of Innocence Feature Film       Drama     139          PG        Columbia Pictures
##  5            Malevolence Feature Film      Horror      90           R Anchor Bay Entertainment
##  6            Old Partner  Documentary Documentary      78     Unrated       Shcalo Media Group
##  7              Lady Jane Feature Film       Drama     142       PG-13     Paramount Home Video
##  8           Mad Dog Time Feature Film       Drama      93           R       MGM/United Artists
##  9 Beauty Is Embarrassing  Documentary Documentary      88     Unrated     Independent Pictures
## 10   The Snowtown Murders Feature Film       Drama     119     Unrated                IFC Films
## # ... with 641 more rows, and 26 more variables: thtr_rel_year <int>, thtr_rel_month <int>,
## #   thtr_rel_day <int>, dvd_rel_year <int>, dvd_rel_month <int>, dvd_rel_day <int>,
## #   imdb_rating <dbl>, imdb_num_votes <int>, critics_rating <chr>, critics_score <int>,
## #   audience_rating <chr>, audience_score <int>, best_pic_nom <chr>, best_pic_win <chr>,
## #   best_actor_win <chr>, best_actress_win <chr>, best_dir_win <chr>, top200_box <chr>,
## #   director <chr>, actor1 <chr>, actor2 <chr>, actor3 <chr>, actor4 <chr>, actor5 <chr>,
## #   imdb_url <chr>, rt_url <chr>

A simple task

Lets create a visualization with the following features:

  • Scatter plot of critics score vs audience score

  • Color points based on title_type (i.e. are they a Documentary, Feature Film, or TV Movie).

  • Include a least squares line for each title type.

With base R

plot(y = movies$audience_score, x = movies$critics_score, 
     col = adjustcolor(as.integer(factor(movies$title_type)),alpha=0.5), pch=16)

doc = movies[movies$title_type == "Documentary", ]
ff = movies[movies$title_type == "Feature Film", ]
tv = movies[movies$title_type == "TV Movie", ]

m_doc = lm(audience_score ~ critics_score, data = doc)
m_ff =  lm(audience_score ~ critics_score, data = ff)
m_tv =  lm(audience_score ~ critics_score, data = tv)

abline(m_doc, col = 1, lwd=2)
abline(m_ff, col = 2, lwd=2)
abline(m_tv, col = 3, lwd=2)

legend("topleft", levels(factor(movies$title_type)), 
       col = c(1,2,3), lty = 1)

With base R

With ggplot2

ggplot(data = movies, aes(x = critics_score, y = audience_score, color = title_type)) +
  geom_point(alpha=0.5) +
  geom_smooth(method = "lm", se = FALSE, fullrange=TRUE)

Grammar of Graphics

The Grammar of Graphics

  • Visualisation concept created by Leland Wilkinson (1999)
    • to define the basic elements of a statistical graphic
  • Adapted for R by Wickham (2009) who created the ggplot2 package
    • consistent and compact syntax to describe statistical graphics
    • highly modular as it breaks up graphs into semantic components
  • It is not meant as a guide to which graph to use and how to best convey your data (more on that later).

The Grammar of Graphics - Terminology

A statistical graphic is a…

  • mapping of data
  • to aesthetic attributes (color, size, xy-position, etc.)
  • using geometric objects (points, lines, bars, etc.)
  • with data being statistically transformed (summarised, log-transformed, etc.)
  • and mapped onto a specific facet and coordinate system

Audience score vs. critics score

  • Which data is used as an input?
  • What geometric objects are chosen for visualization?
  • What variables are mapped onto which attributes?
  • What type of scales are used to map data to aesthetics?
  • Are the variables statistically transformed before plotting?

Audience score vs. critics score - code

ggplot(data = movies, aes(x = audience_score, y = critics_score)) +
  geom_point()

Altering features

  • How did the plot change?
  • Are these changes based on data (i.e. can be mapped to variables in the dataset) or are the changes based on stylistic choices for the geometric objects??

Altering features - code

ggplot(data = movies, aes(x = audience_score, y = critics_score)) +
  geom_point(alpha = 0.5, color = "blue")

Faceting

  • How did the plot change?
  • Are these changes based on data (i.e. can be mapped to variables in the dataset) or are the changes based on stylistic choices for the geometric objects?

Faceting - code

ggplot(data = movies, aes(x = audience_score, y = critics_score, color = genre)) +
  geom_point(alpha = 0.5) +
  facet_grid(. ~ title_type)

More faceting

ggplot(data = movies, aes(x = audience_score, y = critics_score, color = genre)) +
  geom_point(alpha = 0.5) +
  facet_grid(audience_rating ~ title_type)

Even More faceting

ggplot(data = movies, aes(x = audience_score, y = critics_score, color = title_type)) +
  geom_point(alpha = 0.5) +
  facet_wrap(~genre)

Anatomy of a ggplot

ggplot(
  data = [dataframe], 
  aes(
    x = [var_x], y = [var_y], 
    color = [var_for_color], 
    fill = [var_for_fill], 
    shape = [var_for_shape]
  )
) +
  geom_[some_geom]([geom_arguments]) +
  ... # other geometries
  scale_[some_axis]_[some_scale]() +
  facet_[some_facet]([formula]) +
  ... # other options

Various plots

Histograms

ggplot(data = movies, aes(x = audience_score)) +
  geom_histogram(binwidth = 5)

Boxplots

ggplot(data = movies, aes(y = audience_score, x = genre)) +
  geom_boxplot()

Boxplots - axis formatting

ggplot(data = movies, aes(y = audience_score, x = genre)) +
  geom_boxplot() +
  theme(axis.text.x=element_text(angle = 45, hjust = 1))

Density plots - border color

ggplot(data = movies, aes(x = runtime, color = audience_rating)) +
  geom_density() 

Density plots - fill color

ggplot(data = movies, aes(x = runtime, fill = audience_rating)) +
  geom_density() 

Density plots - fill color, with alpha

ggplot(data = movies, aes(x = runtime, fill = audience_rating)) +
  geom_density(alpha = 0.5) 

Scatter plots

ggplot(data = movies, aes(x = imdb_rating, y = audience_score)) +
  geom_point(alpha = 0.5) 

Smoothing

ggplot(data = movies, aes(x = imdb_rating, y = audience_score)) +
  geom_point(alpha = 0.5) +
  geom_smooth()

Smoothing - lm

ggplot(data = movies, aes(x = imdb_rating, y = audience_score)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm")

Barplots

ggplot(data = movies, aes(x = genre)) +
  geom_bar() +
  theme(axis.text.x=element_text(angle = 45, hjust = 1))

Segmented barplots

ggplot(data = movies, aes(x = genre, fill = audience_rating)) +
  geom_bar() +
  theme(axis.text.x=element_text(angle = 45, hjust = 1))

Segmented barplots - proportions

ggplot(data = movies, aes(x = genre, fill = audience_rating)) +
  geom_bar(position = "fill") +
  theme(axis.text.x=element_text(angle = 45, hjust = 1))

Dodged barplots

ggplot(data = movies, aes(x = genre, fill = audience_rating)) +
  geom_bar(position = "dodge") +
  theme(axis.text.x=element_text(angle = 45, hjust = 1))

Not on team theme_grey()?

ggplot(data = movies, aes(x = genre, fill = audience_rating)) +
  geom_bar(position = "dodge") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

More ggplot2 resources

Exercise 1

Recreate the following plot. Hint: Add a labs() layer.

Exercise 2

Recreate the following plot. Hint: the black lines are a linear model (lm) fit to all of the movies within a rating category and the grey line is the 0-1 line (intercept 0, slope 1).

Acknowledgments

Acknowledgments

Above materials are derived in part from the following sources: