class: center, middle, inverse, title-slide # Package
ggplot2
## Statistical Computing & Programming ### Shawn Santo ### 05-25-20 --- ## Supplementary materials Companion videos - [Introduction to `ggplot2`](https://warpwire.duke.edu/w/n8oDAA/) - [Scatterplots in `ggplot2`](https://warpwire.duke.edu/w/ocoDAA/) - [Using other `geom` functions](https://warpwire.duke.edu/w/o8oDAA/) Additional resources - [Chapter 3](https://r4ds.had.co.nz/data-visualisation.html), R for Data Science - `ggplot2` [Reference](https://ggplot2.tidyverse.org/reference/index.html) --- ## `ggplot2` - `ggplot2` is a plotting system for R, based on the grammar of graphics - using the good parts of base and lattice - It takes care of many of the fiddly details that make plotting a hassle - such as drawing legends and faceting - particularly helpful for plotting multivariate data Package `ggplot2` is available in package `tidyverse`. Let's load that now. ```r library(tidyverse) ``` --- ## The Grammar of Graphics - Visualization concept created by Leland Wilkinson (1999) - to define the basic elements of a statistical graphic - Adapted for R by Wickham (2009) - consistent and compact syntax to describe statistical graphics - highly modular as it breaks up graphs into semantic components - It is not meant as a guide to which graph to use and how to best convey your data (more on that later). --- ## Today's data: MLB ```r teams <- read_csv("http://www2.stat.duke.edu/~sms185/data/mlb/teams.csv") ``` Object `teams` is a data frame that contains yearly statistics and standings for MLB teams from 2009 to 2018. The data has 300 rows and 56 variables. --- ## A quick aside on tibbles Object `teams` is a data frame with additional class components. ```r class(teams) ``` ``` #> [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame" ``` <br> - Tibbles have nicer output when printing. - Tibbles do not convert strings to factors automatically. - Tibbles show each vector's type. --- .tiny[ ```r teams ``` ``` #> # A tibble: 300 x 56 #> yearID lgID teamID franchID divID Rank G Ghome W L DivWin #> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> #> 1 2009 NL ARI ARI W 5 162 81 70 92 N #> 2 2009 NL ATL ATL E 3 162 81 86 76 N #> 3 2009 AL BAL BAL E 5 162 81 64 98 N #> 4 2009 AL BOS BOS E 2 162 81 95 67 N #> 5 2009 AL CHA CHW C 3 162 81 79 83 N #> 6 2009 NL CHN CHC C 2 161 80 83 78 N #> 7 2009 NL CIN CIN C 4 162 81 78 84 N #> 8 2009 AL CLE CLE C 4 162 81 65 97 N #> 9 2009 NL COL COL W 2 162 81 92 70 N #> 10 2009 AL DET DET C 2 163 81 86 77 N #> # … with 290 more rows, and 45 more variables: WCWin <chr>, LgWin <chr>, #> # WSWin <chr>, R <dbl>, AB <dbl>, H <dbl>, X2B <dbl>, X3B <dbl>, #> # HR <dbl>, BB <dbl>, SO <dbl>, SB <dbl>, CS <dbl>, HBP <dbl>, SF <dbl>, #> # RA <dbl>, ER <dbl>, ERA <dbl>, CG <dbl>, SHO <dbl>, SV <dbl>, #> # IPouts <dbl>, HA <dbl>, HRA <dbl>, BBA <dbl>, SOA <dbl>, E <dbl>, #> # DP <dbl>, FP <dbl>, name <chr>, park <chr>, attendance <dbl>, #> # BPF <dbl>, PPF <dbl>, teamIDBR <chr>, teamIDlahman45 <chr>, #> # teamIDretro <chr>, TB <dbl>, WinPct <dbl>, rpg <dbl>, hrpg <dbl>, #> # tbpg <dbl>, kpg <dbl>, k2bb <dbl>, whip <dbl> ``` ] --- class: inverse, center, middle # Plot comparison --- ## Using `ggplot()` <img src="lec-08_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## Using `plot()` <img src="lec-08_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ## Code comparison Using `ggplot()` ```r ggplot(teams, mapping = aes(x = R - RA, y = WinPct, color = DivWin)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(x = "Win Percentage", y = "Run Differential") ``` -- Using `plot()` ```r teams$RD <- teams$R - teams$RA teams_div <- teams[teams$DivWin == "Y", ] teams_no_div <- teams[teams$DivWin == "N", ] mod1 <- lm(WinPct ~ RD, data = teams_div) mod2 <- lm(WinPct ~ RD, data = teams_no_div) plot(x = (teams$R - teams$RA), y = teams$WinPct, col = adjustcolor(as.integer(factor(teams$DivWin))), pch = 16, xlab = "Run Differential", ylab = "Win Percentage") abline(mod1, col = 2, lwd=2) abline(mod2, col = 1, lwd=2) ``` --- class: inverse, center, middle # What's in a `ggplot()`? --- ## Terminology A statistical graphic is a... - mapping of **data** - which may be **statistically transformed** (summarized, log-transformed, etc.) - to **aesthetic attributes** (color, size, xy-position, etc.) - using **geometric objects** (points, lines, bars, etc.) - and mapped onto a specific **facet** and **coordinate system.** --- ## What do I "need"? 1) Some data (preferably in a data frame) ```r *ggplot(data = teams) ``` -- 2) A set of variable mappings ```r *ggplot(data = teams, mapping = aes(x = attendance / 1000, y = W)) ``` -- 3) A geom with arguments, or multiple geoms with arguments connected by `+` ```r ggplot(data = teams, mapping = aes(x = attendance / 1000, y = W)) + * geom_point(color = "blue") ``` -- 4) Some options on changing scales or adding facets ```r ggplot(data = teams, mapping = aes(x = attendance / 1000, y = W)) + geom_point(color = "blue") + * facet_wrap(~yearID, nrow = 2) ``` --- 5) Some labels ```r ggplot(data = teams, mapping = aes(x = attendance / 1000, y = W)) + geom_point(color = "blue") + facet_wrap(~yearID, nrow = 2) + * labs(x = "Attendance", y = "Wins", caption = "Attendance in thousands") ``` -- 6) Other options ```r ggplot(data = teams, mapping = aes(x = attendance / 1000, y = W)) + geom_point(color = "blue") + facet_wrap(~yearID, nrow = 2) + labs(x = "Attendance", y = "Wins", caption = "Attendance in thousands") + * theme_bw(base_size = 20) + * theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` --- Some data (preferably in a data frame) <img src="lec-08_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- A set of variable mappings <img src="lec-08_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> --- A geom with arguments, or multiple geoms with arguments connected by `+` <img src="lec-08_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- Some options on changing scales or adding facets <img src="lec-08_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- Some labels <img src="lec-08_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- Other options <img src="lec-08_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> --- ## Anatomy of a ggplot ```r ggplot( data = [dataframe], aes( x = [var_x], y = [var_y], color = [var_for_color], fill = [var_for_fill], shape = [var_for_shape], size = [var_for_size], alpha = [var_for_alpha], ...#other aesthetics ) ) + geom_<some_geom>([geom_arguments]) + ... # other geoms scale_<some_axis>_<some_scale>() + facet_<some_facet>([formula]) + ... # other options ``` To visualize multivariate relationships we can add variables to our visualization by specifying aesthetics: color, size, shape, linetype, alpha, or fill; we can also add facets based on variable levels. --- class: inverse, center, middle # Scatter plots --- ## Base plot .tiny[ ```r ggplot(data = teams, mapping = aes(x = (R ^ 2 / (R ^ 2 + RA ^2 )), y = WinPct)) + * geom_point() ``` <img src="lec-08_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> ] --- ## Altering aesthetic color .tiny[ ```r ggplot(data = teams, mapping = aes(x = (R ^ 2 / (R ^ 2 + RA ^2 )), y = WinPct)) + * geom_point(color = "#E81828") ``` <img src="lec-08_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> ] --- ## Altering aesthetic color .tiny[ ```r *ggplot(data = teams, mapping = aes(x = (R ^ 2 / (R ^ 2 + RA ^2 )), y = WinPct, color = lgID)) + geom_point(show.legend = FALSE) ``` <img src="lec-08_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> ] --- ## Altering aesthetic color .tiny[ ```r ggplot(data = teams, mapping = aes(x = (R ^ 2 / (R ^ 2 + RA ^2 )), y = WinPct, color = lgID)) + * geom_point() ``` <img src="lec-08_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> ] --- ## Base plot .tiny[ ```r ggplot(data = teams[teams$yearID == 2018, ], mapping = aes(x = BB + H, y = SO)) + geom_point() ``` <img src="lec-08_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> ] --- ## Altering multiple aesthetics .tiny[ ```r ggplot(data = teams[teams$yearID == 2018, ], mapping = aes(x = BB + H, y = SO)) + * geom_point(size = 3, shape = 2, color = "#E81828") ``` <img src="lec-08_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" /> ] --- ## Altering multiple aesthetics .tiny[ ```r ggplot(data = teams[teams$yearID == 2018, ], mapping = aes(x = BB + H, y = SO, * color = factor(Rank), shape = factor(Rank))) + geom_point(size = 4, alpha = .8, show.legend = FALSE) ``` <img src="lec-08_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> ] --- ## Altering multiple aesthetics .tiny[ ```r ggplot(data = teams[teams$yearID == 2018, ], mapping = aes(x = BB + H, y = SO, * color = factor(Rank), shape = factor(Rank))) + geom_point(size = 4, alpha = .8) ``` <img src="lec-08_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> ] --- ## Inside or outside `aes()`? When does an aesthetic go inside function `aes()`? - If you want an aesthetic to be reflective of a variable's values, it must go inside aes. - If you want to set an aesthetic manually and not have it convey information about a variable, use the aesthetic's name outside of aes and set it to your desired value. Aesthetics for continuous and discrete variables are measured on continuous and discrete scales, respectively. --- ## Faceting .tiny[ ```r ggplot(data = teams, mapping = aes(x = R, y = WinPct, color = DivWin)) + geom_point(alpha = .8) + * facet_grid(lgID~ .) ``` <img src="lec-08_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" /> ] --- ## Faceting .tiny[ ```r ggplot(data = teams, mapping = aes(x = R, y = WinPct, color = DivWin)) + geom_point(alpha = .8) + * facet_grid(. ~lgID) ``` <img src="lec-08_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" /> ] --- ## Faceting .tiny[ ```r ggplot(data = teams, mapping = aes(x = R, y = WinPct, color = DivWin)) + geom_point(alpha = .8) + * facet_grid(divID~lgID) ``` <img src="lec-08_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" /> ] --- ## Faceting .tiny[ ```r ggplot(data = teams, mapping = aes(x = R, y = WinPct, color = DivWin)) + geom_point(alpha = .8) + * facet_wrap(~yearID) ``` <img src="lec-08_files/figure-html/unnamed-chunk-34-1.png" style="display: block; margin: auto;" /> ] --- ## Facet grid or wrap? - Use `facet_wrap()` to wrap a one dimensional sequence into two dimensional panels. - Use `facet_grid()` when you have two discrete variables and you want panels of plots to represent all possible combinations. --- ## Exercise Use tibble `teams` to re-create the plot below. <img src="lec-08_files/figure-html/unnamed-chunk-35-1.png" style="display: block; margin: auto;" /> <br/> **How can we improve this visualization?** ??? .tiny[ ```r ggplot(data = teams, mapping = aes(x = SO, y = R, color = factor(DivWin))) + geom_point(size = 3, alpha = .8) + facet_wrap(~yearID, nrow = 2) + labs(x = "Strike outs", y = "Runs", color = "Division winner") ``` ] --- ## A more effective visualization <img src="lec-08_files/figure-html/unnamed-chunk-37-1.png" style="display: block; margin: auto;" /> ??? ```r ggplot(data = teams, mapping = aes(x = SO, y = R, color = factor(DivWin))) + geom_point(size = 2, alpha = .8) + geom_hline(yintercept = 750, lty = 2, alpha = .5, color = "blue") + geom_vline(xintercept = 1250, lty = 2, alpha = .5, color = "blue") + facet_wrap(~yearID, nrow = 2) + labs(x = "Strike outs", y = "Runs", color = "Division winner", title = "Division winners generally score more runs", subtitle = "and have fewer strike outs") + scale_color_manual(values = c("grey", "red")) + scale_x_continuous(limits = c(750, 1750), breaks = seq(900, 1700, 350), labels = seq(900, 1700, 350)) + scale_y_continuous(limits = c(500, 1000), breaks = seq(500, 1000, 100), labels = seq(500, 1000, 100)) + theme_bw(base_size = 16) + theme(legend.position = "bottom") ``` --- class: inverse, center, middle # Other geoms --- ## Caution - The following plots are not well-polished. They are designed to demonstrate the various geoms and options that exist within `ggplot2`. - You should always have a well-labelled and polished visualization if it will be seen by an outside audience. --- ## Box plots .tiny[ ```r ggplot(teams, mapping = aes(x = factor(yearID), y = kpg)) + * geom_boxplot(color = "#E81828", fill = "#002D72", alpha = .7) ``` <img src="lec-08_files/figure-html/unnamed-chunk-39-1.png" style="display: block; margin: auto;" /> ] --- ## Box plots: flipped coordinates .tiny[ ```r ggplot(teams, mapping = aes(x = factor(yearID), y = kpg)) + geom_boxplot(color = "#E81828", fill = "#002D72", alpha = .7) + * coord_flip() ``` <img src="lec-08_files/figure-html/unnamed-chunk-40-1.png" style="display: block; margin: auto;" /> ] --- ## Box plots: custom colors .tiny[ ```r ggplot(teams, mapping = aes(x = factor(yearID), y = kpg, fill = lgID)) + geom_boxplot(color = "grey", alpha = .7) + * scale_fill_manual(values = c("#E81828", "#002D72")) + coord_flip() + * theme_bw() ``` <img src="lec-08_files/figure-html/unnamed-chunk-41-1.png" style="display: block; margin: auto;" /> ] --- ## Bar plots .tiny[ ```r ggplot(teams[teams$yearID == 2018, ], mapping = aes(y = W, x = franchID)) + geom_bar(stat = "identity") ``` <img src="lec-08_files/figure-html/unnamed-chunk-42-1.png" style="display: block; margin: auto;" /> ] --- ## Bar plots: angled text .tiny[ ```r ggplot(teams[teams$yearID == 2018, ], mapping = aes(y = W, x = franchID)) + geom_bar(stat = "identity") + * theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` <img src="lec-08_files/figure-html/unnamed-chunk-43-1.png" style="display: block; margin: auto;" /> ] --- ## Bar plots: sorted .tiny[ ```r *ggplot(teams[teams$yearID == 2018, ], mapping = aes(y = W, x = reorder(franchID, -W))) + geom_bar(stat = "identity", color = "#E81828", fill = "#002D72", alpha = .7) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` <img src="lec-08_files/figure-html/unnamed-chunk-44-1.png" style="display: block; margin: auto;" /> ] --- ## Bar plots: granular scale .tiny[ ```r ggplot(teams[teams$yearID == 2018, ], mapping = aes(y = W, x = reorder(franchID, -W))) + geom_bar(stat = "identity", color = "#E81828", fill = "#002D72", alpha = .7) + * scale_y_continuous(breaks = seq(0, 120, 15), labels = seq(0, 120, 15), limits = c(0, 120)) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` <img src="lec-08_files/figure-html/unnamed-chunk-45-1.png" style="display: block; margin: auto;" /> ] --- ## Histograms .tiny[ ```r ggplot(teams, mapping = aes(x = WinPct)) + geom_histogram(binwidth = .025, fill = "#E81828", color = "#002D72", alpha = .7) ``` <img src="lec-08_files/figure-html/unnamed-chunk-46-1.png" style="display: block; margin: auto;" /> ] --- ## Density plots .tiny[ ```r ggplot(teams, mapping = aes(x = WinPct)) + geom_density(fill = "#E81828", color = "#002D72", alpha = .7) ``` <img src="lec-08_files/figure-html/unnamed-chunk-47-1.png" style="display: block; margin: auto;" /> ] --- ## Density plots: custom colors .tiny[ ```r ggplot(teams, mapping = aes(x = WinPct, fill = lgID)) + geom_density(alpha = .5) + * scale_fill_manual(values = c("#E81828", "#002D72")) ``` <img src="lec-08_files/figure-html/unnamed-chunk-48-1.png" style="display: block; margin: auto;" /> ] --- ## Heat maps .tiny[ ```r ggplot(teams[teams$yearID == 2018, ], mapping = aes(x = Rank, y = divID, fill = RD)) + * geom_raster() ``` <img src="lec-08_files/figure-html/unnamed-chunk-49-1.png" style="display: block; margin: auto;" /> ] --- ## Heat maps: color palette .tiny[ ```r ggplot(teams[teams$yearID == 2018, ], mapping = aes(x = Rank, y = divID, fill = RD)) + geom_raster() + * scale_fill_gradientn(colours = terrain.colors(10)) ``` <img src="lec-08_files/figure-html/unnamed-chunk-50-1.png" style="display: block; margin: auto;" /> ] --- ## Heat maps: color palette .tiny[ ```r ggplot(teams[teams$yearID == 2018, ], mapping = aes(x = Rank, y = divID, fill = RD)) + geom_raster() + * scale_fill_gradient(low = "#fef0d9", high = "#b30000") ``` <img src="lec-08_files/figure-html/unnamed-chunk-51-1.png" style="display: block; margin: auto;" /> ] --- ## Some `ggplot2` resources - Refer to the `ggplot2` cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf - Pick a color scheme with [color brewer 2](http://colorbrewer2.org/) --- ## Effective visualization tips - Use chunk options `fig.width`, `fig.height`, `fig.align`, and `fig.show` to manipulate your plot's size and placement. - Strive to have your visualization function in a closed environment. - Be mindful of color and scale choices. - Generally, color is better than shape to make things pop. - Not everything has to have a color, shape, transparency, etc. - Add labels and annotation. - Use your visualization to support your story. --- class: inverse, center, middle # Exercise --- ## Energy data .tiny[ ```r energy <- read_csv("http://www2.stat.duke.edu/~sms185/data/energy/energy.csv") ``` ] .tiny[ ```r energy ``` ``` #> # A tibble: 105 x 6 #> MWhperDay name type location note boe #> <dbl> <chr> <chr> <chr> <chr> <dbl> #> 1 3 Chernobyl Sol… Solar Ukraine On the site of the form… 0 #> 2 637 Solarpark Meu… Solar Germany <NA> 55 #> 3 920 Tesla's propo… Solar South Aus… 50,000 homes with solar… 79 #> 4 1280 Quaid-e-Azam Solar Pakistan Named in honor of Quaid… 110 #> 5 1760 Topaz Solar USA <NA> 152 #> 6 2025 Agua Caliente Solar USA Arizona 175 #> 7 2466 Kamuthi Solar India "\"150,000\" homes" 213 #> 8 2720 Longyangxia Solar China <NA> 234 #> 9 3840 Kurnool Solar India <NA> 331 #> 10 4950 Tengger Desert Solar China "Covers 3.2% of the lan… 427 #> # … with 95 more rows ``` ] --- ## Data dictionary The power sources represent the amount of energy a power source generates each day as represented in daily MWh. - `MWhperDay`: MWh of energy generated per day - `name`: energy source name - `type`: type of energy source - `location`: country of energy source - `note`: more details on energy source - `boe`: barrel of oil equivalent <br> - **Daily megawatt hour (MWh)** is a measure of energy output. - **1 MWh** is, on average, enough power for 28 people in the USA --- ## Objective Re-create the plot on the following slide. A few notes: - base font size is 18 - hex colors: `c("#9d8b7e", "#315a70", "#66344c", "#678b93", "#b5cfe1", "#ffcccc")` - use function `order()` to help get the top 30 Starter code: ```r energy_top_30 <- energy[order(energy$MWhperDay, decreasing = T)[1:30], ] ``` --- <img src="lec-08_files/figure-html/unnamed-chunk-55-1.png" style="display: block; margin: auto;" /> ??? .tiny[ ```r ggplot(energy_top_30, mapping = aes(x = reorder(name, MWhperDay), y = MWhperDay / 1000, fill = type)) + geom_bar(stat = "identity") + scale_fill_manual(values = c("#9d8b7e", "#315a70", "#66344c", "#678b93", "#b5cfe1", "#ffcccc")) + theme_bw(base_size = 18) + labs(y = "Daily MWh (in thousands)", x = "Power Source", title = "Top 30 power source energy generators", fill = "Power Source", caption = "1 MWh is, on average, enough power for 28 people in the USA") + coord_flip() ``` ] --- ## References - Grolemund, G., & Wickham, H. (2019). R for Data Science. R4ds.had.co.nz. https://r4ds.had.co.nz/data-visualisation.html - https://ggplot2.tidyverse.org/reference/ - Lahman, S. (2019) Lahman's Baseball Database, 1871-2018, Main page, http://www.seanlahman.com/baseball-archive/statistics/ - https://www.visualcapitalist.com/worlds-largest-energy-sources/