From last time…

A note on piping and layering

  • The %>% operator in dplyr functions is called the pipe operator. This means you “pipe” the output of the previous line of code as the first input of the next line of code.

  • The + operator in ggplot2 functions is used for “layering”. This means you create the plot in layers, separated by +.

Today’s agenda

Today’s agenda

  • Highlights from the “Tidy data” paper

  • Visualizing data with ggplot2
    • Goal: Learn syntax and various visualizations for different data types

Tidy data

Tidy data

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.
Which of the following is a data set and which is a summary table?
## Source: local data frame [5 x 2]
## 
##   continent mean_lifeexp
## 1    Africa     54.80604
## 2  Americas     73.60812
## 3      Asia     70.72848
## 4    Europe     77.64860
## 5   Oceania     80.71950





## Source: local data frame [142 x 3]
## 
##                     country continent lifeExp
## 1                   Algeria    Africa  72.301
## 2                    Angola    Africa  42.731
## 3                     Benin    Africa  56.728
## 4                  Botswana    Africa  50.728
## 5              Burkina Faso    Africa  52.295
## 6                   Burundi    Africa  49.580
## 7                  Cameroon    Africa  50.430
## 8  Central African Republic    Africa  44.741
## 9                      Chad    Africa  50.651
## 10                  Comoros    Africa  65.152
## ..                      ...       ...     ...

Data visualization

ggplot2

  • To use ggplot2 functions, first load the package
library(ggplot2)
  • In ggplot2 the structure of the code for plots can often be summarized as
ggplot + 
  geom_xxx

or, more precisely

ggplot(data = [dataset], aes(x = [x-variable], [y-variable])) +
   geom_xxx() +
   other options
  • Geoms, short for geometric objects, describe the type of plot you will produce

Context: Fuel economy

data: mpg - fuel economy data from 1999 and 2008 for 38 popular models of cars

  • Run the following in the Console:
# see the help file
?mpg
# view data
View(mpg)
How many rows and columns does this dataset have? What does each row represent? What does each column represent?
Make a prediction: What relationship do you expect to see between engine size (displ) and mileage (hwy)?

Scatterplots

Displacement vs. highway mpg (geom_point())

How would you describe this relationship? What other variables would help us understand data points that don’t follow the overall trend?
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point()

Additional variables

Can display additional variables with

  • aesthetics (like shape, colour, size), or

  • faceting (small multiples displaying different subsets)

Aesthetics

Aesthetics options

Visual characteristics of plotting characters that can be mapped to data are

  • color

  • size

  • shape

  • alpha (transparency)

Displacement vs. highway mpg + class

ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point()

Aesthetics summary

aesthetics discrete continuous
color rainbow of colors gradient
size discrete steps linear mapping between radius and value
shape different shape for each shouldn’t (and doesn’t) work

Faceting

Faceting options

  • Smaller plots that display different subsets of the data

  • Useful for exploring conditional relationships and large data

Displacement vs. highway mpg by cylinders

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl) +
  geom_point()

Dive further…

In the next few slides describe what each plot displays. Think about how the code relates to the output.

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .) +
  geom_point()