class: center, middle, inverse, title-slide # Visualizing Data ### Yue Jiang ### STA 210 / Duke University / Spring 2023 --- ## EDA and data visualization - .vocab[Exploratory data analysis (EDA)] is an approach to analyzing data sets to summarize the main characteristics. - Often, EDA is visual. That's what we're focusing on today. > *"The simple graph has brought more information to the data analyst’s mind than any other device." — John Tukey* <br> - .vocab[Data visualization] is the creation and study of the visual representation of data. - There are many tools for visualizing data (R is one of them), and many approaches/systems within R for making data visualizations - We'll be using **`ggplot2`**. --- ## A Grammar of Graphics A tool that allows for concisely describing the components of a graphic: <img src="img/gglayers.png" width="70%" style="display: block; margin: auto;" /> --- ## Hello ggplot2! - `ggplot()` is the main function in ggplot2 and plots are constructed in layers - The structure of the code for plots can often be summarized as ```r ggplot + geom_xxx + ... ``` or, more precisely .small[ ```r ggplot(data = [dataset], mapping = aes(x = ..., [even more?])) + geom_xxx() + other_options + ... ``` ] To use ggplot2 functions, first load tidyverse ```r library(tidyverse) ``` For help with the ggplot2, see [ggplot2.tidyverse.org](http://ggplot2.tidyverse.org/) --- ## Tidy data .question[ What does each row represent? What does each column represent? ] .small[ ``` ## Rows: 87 ## Columns: 14 ## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or~ ## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2~ ## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.~ ## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N~ ## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "~ ## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",~ ## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, ~ ## $ sex <chr> "male", "none", "none", "male", "female", "male", "female",~ ## $ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini~ ## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T~ ## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma~ ## $ films <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return~ ## $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp~ ## $ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",~ ``` ] .center[ Each row is an .vocab[observation]; each column is a .vocab[variable] ] --- ## Luke Skywalker <img src="img/luke-skywalker.png" width="90%" style="display: block; margin: auto;" /> --- ## What's in the Star Wars data? Take a `glimpse` of the data: ```r glimpse(starwars) ``` ``` ## Rows: 87 ## Columns: 14 ## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or~ ## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2~ ## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.~ ## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N~ ## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "~ ## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",~ ## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, ~ ## $ sex <chr> "male", "none", "none", "male", "female", "male", "female",~ ## $ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini~ ## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T~ ## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma~ ## $ films <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return~ ## $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp~ ## $ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",~ ``` --- ## A basic plot... ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() ``` ``` ## Warning: Removed 28 rows containing missing values (geom_point). ``` <img src="data-viz_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- ## What is the dataset being plotted? ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() ``` ``` ## Warning: Removed 28 rows containing missing values (geom_point). ``` <img src="data-viz_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ## What is the aesthetic mapping? ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() ``` ``` ## Warning: Removed 28 rows containing missing values (geom_point). ``` <img src="data-viz_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- ## What function is doing the plotting? ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() ``` ``` ## Warning: Removed 28 rows containing missing values (geom_point). ``` <img src="data-viz_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- ## What's that warning? ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() ``` ``` ## Warning: Removed 28 rows containing missing values (geom_point). ``` <img src="data-viz_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- ## What's that warning? - Not all characters have height and mass information (hence 28 of them not plotted) ``` ## Warning: Removed 28 rows containing missing values (geom_point). ``` - We can suppress warnings to save space on the output documents, but it's important to note them - To suppress warning: .center[ `{r code-chunk-label, warning=FALSE}` ] --- ## Adding a layer for labels ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + geom_smooth(method = "lm") + labs(title = "Mass vs. height of Starwars characters", x = "Height (cm)", y = "Mass (kg)") ``` <img src="data-viz_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- ## Mass vs. height .question[ How would you describe this **relationship**? What other variables would help us understand data points that don't follow the overall trend? Who is the not so tall but really heavy character? ] .small[ <img src="data-viz_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> ] --- ## Additional variables We can map additional variables to various visual characteristics of the plot: - **aesthetics** - shape - color - size - alpha (transparency) - **faceting**: small multiples displaying different subsets --- ## Mass vs. height + hair ```r ggplot(data = starwars, mapping = aes(x = height, y = mass, color = hair_color)) + geom_point() ``` <img src="data-viz_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> --- ## Mass vs. height + hair ```r ggplot(data = starwars, mapping = aes(x = height, y = mass, color = hair_color)) + geom_point(size = 3) ``` <img src="data-viz_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- ## Mass vs. height + hair ```r ggplot(data = starwars, mapping = aes(x = height, y = mass, color = hair_color, shape = hair_color)) + geom_point(size = 3) ``` <img src="data-viz_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- ## Mass vs. height + hair + birth year ```r ggplot(data = starwars, mapping = aes(x = height, y = mass, color = hair_color, shape = hair_color, size = birth_year)) + geom_point() ``` <img src="data-viz_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- ## Aesthetics summary - Continuous variable are measured on a continuous scale - Discrete variables are measured (or often counted) on a discrete scale .small[ aesthetics | discrete | continuous ------------- | ------------------------ | ------------ color | rainbow of colors | gradient size | discrete steps | linear mapping between radius and value shape | different shape for each | shouldn't (and doesn't) work ] <br> .alert[Use aesthetics (`aes`) for mapping features of a plot to a variable, define the features in the `geom_xxx` for customization **<u>not</u>** mapped to a variable ] --- ## Additional plot options ```r ggplot(data = starwars, mapping = aes(x = height, y = mass, color = hair_color, shape = hair_color, size = birth_year)) + geom_point() + labs(title = "Positive relationship between height and weight", subtitle = "Jabba is a conspicuous outlier", x = "Height (cm.)", y = "Mass (kg.)", color = "Hair color", shape = "Hair color", size = "Birth year (BBY)") ``` --- ## Additional plot options <img src="data-viz_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> --- ## Faceting options - Smaller plots that display different subsets of the data - Useful for exploring conditional relationships and large data .small[ ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + facet_grid(. ~ sex) + labs(title = "Mass vs. height of Starwars characters", subtitle = "Faceted by sex", x = "Height (cm)", y = "Weight (kg)") ``` <img src="data-viz_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> ] --- ## Dive further... .question[ In the next few slides describe what each plot displays. Think about how the code relates to the output. ] - `facet_grid()`: 2d grid, `rows ~ cols` - `facet_wrap()`: 1d ribbon wrapped into 2d The plots in the next few slides do not have proper titles, axis labels, etc, so you can more easily focus on what's happening in the plots. But you should always label your plots! --- ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + facet_grid(hair_color ~ .) ``` <img src="data-viz_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> --- ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + facet_grid(. ~ hair_color) ``` <img src="data-viz_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> --- ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + facet_wrap(~ eye_color) ``` <img src="data-viz_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> --- ## Number of variables involved - .vocab[Univariate data analysis]: distribution of single variable - .vocab[Bivariate data analysis]: relationship between two variables - .vocab[Multivariate data analysis]: relationship between many variables at once, usually focusing on the relationship between two while conditioning for others --- ## Types of variables - .vocab[Numerical variables] can be classified as .vocab[continuous] or .vocab[discrete] based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively. - *height* is continuous - *number of siblings* is discrete -- - If the variable is .vocab[categorical], we can determine if it is .vocab[ordinal] based on whether or not the levels have a natural ordering. - *hair color* is unordered - *highest educational attainment* is ordinal --- ## Describing numerical distributions - **shape:** - skewness: right-skewed, left-skewed, symmetric - modality: unimodal, bimodal, multimodal, uniform - **center:** mean, median, mode (not always useful) - **spread:** range, standard deviation, inter-quartile range - **outliers**: observations outside of the usual pattern --- ## Histograms .small[ ```r ggplot(data = starwars, mapping = aes(x = height)) + geom_histogram(binwidth = 10) ``` ``` ## Warning: Removed 6 rows containing non-finite values (stat_bin). ``` <img src="data-viz_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> ] --- ## Density plots .small[ ```r ggplot(data = starwars, mapping = aes(x = height)) + geom_density() ``` ``` ## Warning: Removed 6 rows containing non-finite values (stat_density). ``` <img src="data-viz_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" /> ] --- ## Side-by-side box plots .small[ ```r ggplot(data = starwars, mapping = aes(y = height, x = hair_color)) + geom_boxplot() ``` ``` ## Warning: Removed 6 rows containing non-finite values (stat_boxplot). ``` <img src="data-viz_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> ] --- ## Bar plots .small[ ```r ggplot(data = starwars, mapping = aes(x = hair_color)) + geom_bar() ``` <img src="data-viz_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> ] --- ## Segmented bar plots, counts .small[ ```r ggplot(data = starwars, mapping = aes(x = hair_color, fill = eye_color2)) + geom_bar() ``` <img src="data-viz_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" /> ] --- ## Segmented bar plots, proportions .small[ ```r ggplot(data = starwars, mapping = aes(x = hair_color, fill = eye_color2)) + geom_bar(position = "fill") + labs(y = "proportion") ``` <img src="data-viz_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" /> ] --- class: center, middle # Why do we visualize? --- ## Data: `datasaurus_dozen` Below is an excerpt from the `datasaurus_dozen` dataset: ``` ## Warning: package 'datasauRus' was built under R version 4.1.3 ``` ``` ## # A tibble: 142 x 8 ## away_x away_y bullseye_x bullseye_y circle_x circle_y dino_x dino_y ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 32.3 61.4 51.2 83.3 56.0 79.3 55.4 97.2 ## 2 53.4 26.2 59.0 85.5 50.0 79.0 51.5 96.0 ## 3 63.9 30.8 51.9 85.8 51.3 82.4 46.2 94.5 ## 4 70.3 82.5 48.2 85.0 51.2 79.2 42.8 91.4 ## 5 34.1 45.7 41.7 84.0 44.4 78.2 40.8 88.3 ## 6 67.7 37.1 37.9 82.6 45.0 77.9 38.7 84.9 ## 7 53.3 97.5 39.5 80.8 48.6 78.8 35.6 79.9 ## 8 63.5 25.1 39.6 82.7 42.1 76.9 33.1 77.6 ## 9 68.0 81.0 34.8 80.0 41.0 76.4 29.0 74.5 ## 10 67.4 29.7 27.6 72.8 34.6 72.7 26.2 71.4 ## # ... with 132 more rows ``` --- ## Summary statistics ```r datasaurus_dozen %>% group_by(dataset) %>% summarise(r = cor(x, y)) ``` ``` ## # A tibble: 13 x 2 ## dataset r ## <chr> <dbl> ## 1 away -0.0641 ## 2 bullseye -0.0686 ## 3 circle -0.0683 ## 4 dino -0.0645 ## 5 dots -0.0603 ## 6 h_lines -0.0617 ## 7 high_lines -0.0685 ## 8 slant_down -0.0690 ## 9 slant_up -0.0686 ## 10 star -0.0630 ## 11 v_lines -0.0694 ## 12 wide_lines -0.0666 ## 13 x_shape -0.0656 ``` --- ## Visualize the relationships! .question[ How similar do the relationships between `x` and `y` look based on the plots? Based on the summary statistics? ] <img src="data-viz_files/figure-html/datasaurus-plot-1.png" style="display: block; margin: auto;" /> --- ## Anscombe's quartet <img src="img/anscombe.png" width="70%" style="display: block; margin: auto;" /> --- ## A website you'll probably return to [Top 50 ggplot2 visualizations in R](http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html) --- ## Let's practice! <img src="img/spice.jpg" width="65%" style="display: block; margin: auto;" /> <p style="text-align: center;"> <a href = "https://classroom.github.com/a/UW2Fj8Ed">https://classroom.github.com/a/UW2Fj8Ed</a> </p>