03 - Fundamentals of data & data visualization

September 5, 2017

Getting started

Follow up from last time…

Any questions on material from last time?
Any questions on homework?
Any questions on workflow / course structure?
Catch up on informal "requirements":
- Did everyone download the Slack app?
- Did everyone adjust their Slack username to be/include their (preferred) first name?
- Did everyone add their photo to the GitHub and Slack profiles?

R in a nutshell

Functions in R are often verbs, and then in parantheses are the arguments for those functions.

verb(what-you-want-to-apply-verb-to, other-arguments)

For example:

glimpse(gapminder)      # Glimpse into the gapminder dataset

gapminder %>%           # Pipe into the next function
  filter(year == 1952)  # Filter if year is equal to 1952

Data visualization

"The simple graph has brought more information to the data analyst’s mind than any other device." — John Tukey

Data visualization is the creation and study of the visual representation of data.
There are many tools for visualizing data (R is one of them), and many approaches/systems within R for making data visualizations (ggplot2 is one of them, and that's the one we're going to use).

ggplot2

To use ggplot2 functions, first load the package

library(ggplot2)

In ggplot2 the structure of the code for plots can often be summarized as

ggplot + 
  geom_xxx

or, more precisely

ggplot(data = [dataset], aes(x = [x-variable], y = [y-variable])) +
   geom_xxx() +
   other options

Geoms, short for geometric objects, describe the type of plot you will produce

About ggplot2

ggplot2 is the name of the package
The gg in "ggplot2" stands for Grammar of Graphics
Inspired by the book Grammar of Graphics by Lee Wilkinson
ggplot() is the main function in ggplot2

Visualizing Star Wars

Star Wars data

The dplyr package contains a dataset called starwars:

library(dplyr)
starwars

## # A tibble: 87 x 13
##                  name height  mass    hair_color  skin_color eye_color
##                 <chr>  <int> <dbl>         <chr>       <chr>     <chr>
##  1     Luke Skywalker    172    77         blond        fair      blue
##  2              C-3PO    167    75          <NA>        gold    yellow
##  3              R2-D2     96    32          <NA> white, blue       red
##  4        Darth Vader    202   136          none       white    yellow
##  5        Leia Organa    150    49         brown       light     brown
##  6          Owen Lars    178   120   brown, grey       light      blue
##  7 Beru Whitesun lars    165    75         brown       light      blue
##  8              R5-D4     97    32          <NA>  white, red       red
##  9  Biggs Darklighter    183    84         black       light     brown
## 10     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
## # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
## #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Dataset terminology

What does each row represent? What does each column represent?

starwars

## # A tibble: 87 x 13
##                  name height  mass    hair_color  skin_color eye_color
##                 <chr>  <int> <dbl>         <chr>       <chr>     <chr>
##  1     Luke Skywalker    172    77         blond        fair      blue
##  2              C-3PO    167    75          <NA>        gold    yellow
##  3              R2-D2     96    32          <NA> white, blue       red
##  4        Darth Vader    202   136          none       white    yellow
##  5        Leia Organa    150    49         brown       light     brown
##  6          Owen Lars    178   120   brown, grey       light      blue
##  7 Beru Whitesun lars    165    75         brown       light      blue
##  8              R5-D4     97    32          <NA>  white, red       red
##  9  Biggs Darklighter    183    84         black       light     brown
## 10     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
## # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
## #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Luke Skywalker

luke-skywalker

What's in the Star Wars data?

Take a glimpse at the data:

glimpse(starwars)

## Observations: 87
## Variables: 13
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", ...
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188...
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 8...
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "b...
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "l...
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue",...
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0...
## $ gender     <chr> "male", NA, NA, "male", "female", "male", "female",...
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alder...
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human...
## $ films      <list> [<"Revenge of the Sith", "Return of the Jedi", "Th...
## $ vehicles   <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>,...
## $ starships  <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Adva...

What's in the Star Wars data?

Run the following in the Console to view the help

?starwars

How many rows and columns does this dataset have? What does each row represent? What does each column represent?

Make a prediction: What relationship do you expect to see between height and mass?

Scatterplots

Mass vs. height (`geom_point()`)

ggplot(data = starwars, aes(x = height, y = mass)) +
  geom_point()

## Warning: Removed 28 rows containing missing values (geom_point).

Not all characters have height and mass information (hence 28 of them not plotted)

Mass vs. height

How would you describe this relationship? What other variables would help us understand data points that don't follow the overall trend?

Mass vs. height

Who is the not so tall but really chubby character?

Additional variables

Can display additional variables with

aesthetics (like shape, colour, size), or
faceting (small multiples displaying different subsets)

Aesthetics

Aesthetics options

Visual characteristics of plotting characters that can be mapped to data are

color
size
shape
alpha (transparency)

Mass vs. height + gender

ggplot(data = starwars, aes(x = height, y = mass, color = gender)) +
  geom_point()

## Warning: Removed 28 rows containing missing values (geom_point).

Aesthetics summary

Continuous variable are measured on a continuous scale
Discrete variables are measured (or often counted) on a discrete scale

aesthetics	discrete	continuous
color	rainbow of colors	gradient
size	discrete steps	linear mapping between radius and value
shape	different shape for each	shouldn't (and doesn't) work

Faceting

Faceting options

Smaller plots that display different subsets of the data
Useful for exploring conditional relationships and large data

Mass vs. height by gender

ggplot(data = starwars, aes(x = height, y = mass)) +
  facet_grid(. ~ gender) +
  geom_point()

## Warning: Removed 28 rows containing missing values (geom_point).

Dive further…

In the next few slides describe what each plot displays. Think about how the code relates to the output.

ggplot(data = starwars, aes(x = height, y = mass)) +
  facet_grid(gender ~ .) +
  geom_point()

## Warning: Removed 28 rows containing missing values (geom_point).

ggplot(data = starwars, aes(x = height, y = mass)) +
  facet_grid(. ~ gender) +
  geom_point()

## Warning: Removed 28 rows containing missing values (geom_point).

ggplot(data = starwars, aes(x = height, y = mass)) +
  facet_wrap(~ eye_color) +
  geom_point()

## Warning: Removed 28 rows containing missing values (geom_point).

Facet summary

facet_grid(): 2d grid, rows ~ cols, . for no split
facet_wrap(): 1d ribbon wrapped into 2d

Other geoms

/Height vs. mass, take 2

How are these plots similar? How are they different?

`geom_smooth`

To plot a smooth curve, use geom_smooth()

ggplot(data = starwars, aes(x = height, y = mass)) +
  geom_smooth()

Exploratory data analysis (EDA)

Number of variables involved

Univariate data analysis - distribution of single variable
Bivariate data analysis - relationship between two variables
Multivariate data analysis - relationship between many variables at once, usually focusing on the relationship between two while conditioning for others

Types of variables

Numerical variables can be classified as continuous or discrete based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively.
If the variable is categorical, we can determine if it is ordinal based on whether or not the levels have a natural ordering.

Describing shapes of numerical distributions

shape:
- skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail)
- modality: unimodal, bimodal, multimodal, uniform
center: mean (mean), median (median), mode (not always useful)
spead: range (range), standard deviation (sd), inter-quartile range (IQR)
unusal observations

Histograms

For numerical variables

ggplot(starwars, aes(x = height)) +
  geom_histogram(binwidth = 10)

## Warning: Removed 6 rows containing non-finite values (stat_bin).

Bar plots

For categorical variables

ggplot(starwars, aes(x = gender)) +
  geom_bar()

Getting started

Follow up from last time…

R in a nutshell

Data visualization

Data visualization

ggplot2

About ggplot2

Visualizing Star Wars

Star Wars data

Dataset terminology

Luke Skywalker

What's in the Star Wars data?

What's in the Star Wars data?

Scatterplots

Mass vs. height (geom_point())

Mass vs. height

Mass vs. height

Additional variables

Aesthetics

Aesthetics options

Mass vs. height + gender

Aesthetics summary

Faceting

Faceting options

Mass vs. height by gender

Dive further…

Facet summary

Other geoms

/Height vs. mass, take 2

geom_smooth

Exploratory data analysis (EDA)

Number of variables involved

Types of variables

Describing shapes of numerical distributions

Histograms

Bar plots

Mass vs. height (`geom_point()`)

`geom_smooth`