class: center, middle, inverse, title-slide # ggplot2 ### Colin Rundel ### 2019-02-12 --- exclude: true --- ## ggplot2 .pull-left[ <img src="imgs/hex-ggplot2.png" width="250px" style="display: block; margin: auto;" /> ] .pull-right[.midi[ - ggplot2 is a plotting system for R, based on the grammar of graphics * using the good parts of base and lattice - It takes care of many of the fiddly details that make plotting a hassle * like drawing legends and faceting * particularly helpful for plotting multivariate data ]] --- ## Movies data The data set is comprised of 651 randomly sampled movies produced and released before 2016. Data come from [IMDB](http://www.imdb.com/) and [Rotten Tomatoes](http://www.rottentomatoes.com/). The codebook is available [here](data/movies/movies.html). .small[ ```r movies = readr::read_csv("http://bit.ly/sta323_movie_data") movies ``` ``` ## # A tibble: 651 x 32 ## title title_type genre runtime mpaa_rating studio thtr_rel_year thtr_rel_month thtr_rel_day dvd_rel_year ## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Fill… Feature F… Drama 80 R Indom… 2013 4 19 2013 ## 2 The … Feature F… Drama 101 PG-13 Warne… 2001 3 14 2001 ## 3 Wait… Feature F… Come… 84 R Sony … 1996 8 21 2001 ## 4 The … Feature F… Drama 139 PG Colum… 1993 10 1 2001 ## 5 Male… Feature F… Horr… 90 R Ancho… 2004 9 10 2005 ## 6 Old … Documenta… Docu… 78 Unrated Shcal… 2009 1 15 2010 ## 7 Lady… Feature F… Drama 142 PG-13 Param… 1986 1 1 2003 ## 8 Mad … Feature F… Drama 93 R MGM/U… 1996 11 8 2004 ## 9 Beau… Documenta… Docu… 88 Unrated Indep… 2012 9 7 2013 ## 10 The … Feature F… Drama 119 Unrated IFC F… 2012 3 2 2012 ## # … with 641 more rows, and 22 more variables: dvd_rel_month <dbl>, dvd_rel_day <dbl>, imdb_rating <dbl>, ## # imdb_num_votes <dbl>, critics_rating <chr>, critics_score <dbl>, audience_rating <chr>, ## # audience_score <dbl>, best_pic_nom <chr>, best_pic_win <chr>, best_actor_win <chr>, ## # best_actress_win <chr>, best_dir_win <chr>, top200_box <chr>, director <chr>, actor1 <chr>, actor2 <chr>, ## # actor3 <chr>, actor4 <chr>, actor5 <chr>, imdb_url <chr>, rt_url <chr> ``` ] --- ## A simple task Lets create a visualization with the following features: * Scatter plot of critics score vs audience score * Color points based on `title_type` (i.e. are they a Documentary, Feature Film, or TV Movie). * Include a least squares line for each title type. --- ## With base R ```r plot(y = movies$audience_score, x = movies$critics_score, col = adjustcolor(as.integer(factor(movies$title_type)),alpha=0.5), pch = 16) doc = movies[movies$title_type == "Documentary", ] ff = movies[movies$title_type == "Feature Film", ] tv = movies[movies$title_type == "TV Movie", ] m_doc = lm(audience_score ~ critics_score, data = doc) m_ff = lm(audience_score ~ critics_score, data = ff) m_tv = lm(audience_score ~ critics_score, data = tv) abline(m_doc, col = 1, lwd=2) abline(m_ff, col = 2, lwd=2) abline(m_tv, col = 3, lwd=2) legend("topleft", levels(factor(movies$title_type)), col = c(1,2,3), lty = 1) ``` --- ## With base R (output) <img src="Lec09_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ## With ggplot2 ```r ggplot(data = movies, aes(x = critics_score, y = audience_score, color = title_type)) + geom_point(alpha=0.5) + geom_smooth(method = "lm", se = FALSE, fullrange=TRUE) ``` <img src="Lec09_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ## The Grammar of Graphics - Visualisation concept created by Leland Wilkinson (1999) - to define the basic elements of a statistical graphic - Adapted for R by Wickham (2009) - consistent and compact syntax to describe statistical graphics - highly modular as it breaks up graphs into semantic components - It is not meant as a guide to which graph to use and how to best convey your data (more on that later). --- ## Terminology A statistical graphic is a... - mapping of **data** - which may be **statistically transformed** (summarised, log-transformed, etc.) - to **aesthetic attributes** (color, size, xy-position, etc.) - using **geometric objects** (points, lines, bars, etc.) - and mapped onto a specific **facet** and **coordinate system** --- ## Audience score vs. critics score <img src="Lec09_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> - Which data is used as an input? - Are the variables statistically transformed before plotting? - What geometric objects are used to represent the data? - What variables are mapped onto which aesthetic attributes? - What type of scales are used to map data to aesthetics? --- ## Code ```r ggplot(data = movies, aes(x = audience_score, y = critics_score)) + geom_point() ``` <img src="Lec09_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ## Altering features <img src="Lec09_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> - How did the plot change? - Are these changes based on data or are the changes based on stylistic choices for the geometric objects? --- ## Code ```r ggplot(data = movies, aes(x = audience_score, y = critics_score)) + geom_point(alpha = 0.5, color = "blue") ``` <img src="Lec09_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- ## Faceting (Small multiples) <img src="Lec09_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> - How did the plot change? - Are these changes based on data (i.e. can be mapped to variables in the dataset) or are the changes based on stylistic choices for the geometric objects? --- ## Faceting - code ```r ggplot(data = movies, aes(x = audience_score, y = critics_score, color = genre)) + geom_point(alpha = 0.5) + facet_grid(~ title_type) ``` <img src="Lec09_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- ## Alternative Faceting ```r ggplot(data = movies, aes(x = audience_score, y = critics_score, color = genre)) + geom_point(alpha = 0.5) + facet_grid(title_type ~ .) ``` <img src="Lec09_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- ## More faceting ```r ggplot(data = movies, aes(x = audience_score, y = critics_score, color = genre)) + geom_point(alpha = 0.5) + facet_grid(audience_rating ~ title_type) ``` <img src="Lec09_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- ## Even More faceting ```r ggplot(data = movies, aes(x = audience_score, y = critics_score, color = title_type)) + geom_point(alpha = 0.5) + facet_wrap(~genre) ``` <img src="Lec09_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- ## Anatomy of a ggplot ```r ggplot( data = [dataframe], aes( x = [var_x], y = [var_y], color = [var_for_color], fill = [var_for_fill], shape = [var_for_shape] ) ) + geom_[some_geom]([geom_arguments]) + ... # other geometries scale_[some_axis]_[some_scale]() + facet_[some_facet]([formula]) + ... # other options ``` --- class: middle count: false # Various plots --- ## Histograms ```r ggplot(data = movies, aes(x = audience_score)) + geom_histogram(binwidth = 5) ``` <img src="Lec09_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- ## Boxplots ```r ggplot(data = movies, aes(y = audience_score, x = genre)) + geom_boxplot() ``` <img src="Lec09_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> --- ## Boxplots - axis formatting ```r ggplot(data = movies, aes(y = audience_score, x = genre)) + geom_boxplot() + theme(axis.text.x=element_text(angle = 45, hjust = 1)) ``` <img src="Lec09_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- ## Boxplots - flipping axes ```r ggplot(data = movies, aes(y = audience_score, x = genre)) + geom_boxplot() + coord_flip() ``` <img src="Lec09_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- ## Density plots ```r ggplot(data = movies, aes(x = runtime, color = audience_rating)) + geom_density() ``` <img src="Lec09_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- ## Density plots - fill vs color ```r ggplot(data = movies, aes(x = runtime, fill = audience_rating)) + geom_density() ``` <img src="Lec09_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> --- ## Density plots - fill vs color + alpha ```r ggplot(data = movies, aes(x = runtime, fill = audience_rating)) + geom_density(alpha = 0.5) ``` <img src="Lec09_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> --- ## Scatter plots ```r ggplot(data = movies, aes(x = imdb_rating, y = audience_score)) + geom_point(alpha = 0.5) ``` <img src="Lec09_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> --- ## Smoothing ```r ggplot(data = movies, aes(x = imdb_rating, y = audience_score)) + geom_point(alpha = 0.5) + geom_smooth() ``` ``` ## `geom_smooth()` using method = 'loess' and formula 'y ~ x' ``` <img src="Lec09_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> --- ## Smoothing - lm ```r ggplot(data = movies, aes(x = imdb_rating, y = audience_score)) + geom_point(alpha = 0.5) + geom_smooth(method = "lm") ``` <img src="Lec09_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> --- ## Barplots ```r ggplot(data = movies, aes(x = genre)) + geom_bar() + theme(axis.text.x=element_text(angle = 45, hjust = 1)) ``` <img src="Lec09_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> --- ## Segmented barplots ```r ggplot(data = movies, aes(x = genre, fill = audience_rating)) + geom_bar() + theme(axis.text.x=element_text(angle = 45, hjust = 1)) ``` <img src="Lec09_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> --- ## Segmented barplots - proportions ```r ggplot(data = movies, aes(x = genre, fill = audience_rating)) + geom_bar(position = "fill") + theme(axis.text.x=element_text(angle = 45, hjust = 1)) ``` <img src="Lec09_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" /> --- ## Dodged barplots ```r ggplot(data = movies, aes(x = genre, fill = audience_rating)) + geom_bar(position = "dodge") + theme(axis.text.x=element_text(angle = 45, hjust = 1)) ``` <img src="Lec09_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> --- ## Don't like `theme_grey()`? ```r ggplot(data = movies, aes(x = genre, fill = audience_rating)) + geom_bar(position = "dodge") + theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` <img src="Lec09_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> --- class: middle count: false # Scales --- ## Scales ```r ggplot(data = movies, aes(x = audience_score, y = critics_score, color = mpaa_rating)) + geom_point(alpha = 0.5) + scale_x_log10() + scale_y_sqrt() ``` <img src="Lec09_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" /> --- ## Scales - Color (Viridis) ```r ggplot(data = movies, aes(x = audience_score, y = critics_score, color = mpaa_rating)) + geom_point() + scale_color_viridis_d() ``` <img src="Lec09_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" /> --- ## Scales - [Color Brewer](http://colorbrewer2.org/) ```r ggplot(data = movies, aes(x = audience_score, y = critics_score, color = mpaa_rating)) + geom_point() + scale_color_brewer(palette = "Accent") ``` <img src="Lec09_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" /> --- ## Scales again ```r ggplot(data = movies, aes(x = runtime, fill = audience_rating)) + geom_density(alpha = 0.5) + scale_x_log10() ``` <img src="Lec09_files/figure-html/unnamed-chunk-34-1.png" style="display: block; margin: auto;" /> --- ## Scales again ```r ggplot(data = movies, aes(x = runtime, fill = audience_rating)) + geom_density() + scale_x_log10() + scale_fill_manual(values=c("#4B9CD3","#001A57")) ``` <img src="Lec09_files/figure-html/unnamed-chunk-35-1.png" style="display: block; margin: auto;" /> --- ## More `ggplot2` resources - Visit http://docs.ggplot2.org/current/ for documentation on the current version of the `ggplot2` package. Lots of examples for every geometry and layer type. - Refer to the `ggplot2` cheatsheet: https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf - Themes vignette: http://docs.ggplot2.org/dev/vignettes/themes.html --- ## Exercise 1 Recreate the following plot. Hint: Add a `labs()` layer. <img src="Lec09_files/figure-html/unnamed-chunk-36-1.png" style="display: block; margin: auto;" /> --- ## Exercise 2 Recreate the following plot. Hint: the black lines are a linear model (lm) fit to *all* of the movies within a rating category and the grey line is the 0-1 line (intercept 0, slope 1). ``` ## Warning: Removed 23 rows containing missing values (geom_smooth). ``` <img src="Lec09_files/figure-html/unnamed-chunk-37-1.png" style="display: block; margin: auto;" /> --- class: middle count: false # Acknowledgments --- ## Acknowledgments Above materials are derived in part from the following sources: * Mine Cetinkaya-Rundel's [DataFest 2016 Visualization Workshop](https://github.com/mine-cetinkaya-rundel/df2016_workshops/tree/master/viz_ggplot2_shiny) * [RStudio Data Visualization Cheat Sheet](https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf) * Package Documentation - [ggplot2](http://docs.ggplot2.org/current/) * Tim Winkle's [ggplot2 Workshop](https://rpubs.com/timwinke/ggplot2workshop)