September 5, 2017

Getting started

Follow up from last time…

  • Any questions on material from last time?

  • Any questions on homework?

  • Any questions on workflow / course structure?

  • Catch up on informal "requirements":
    • Did everyone download the Slack app?
    • Did everyone adjust their Slack username to be/include their (preferred) first name?
    • Did everyone add their photo to the GitHub and Slack profiles?

R in a nutshell

Functions in R are often verbs, and then in parantheses are the arguments for those functions.

verb(what-you-want-to-apply-verb-to, other-arguments)

For example:

glimpse(gapminder)      # Glimpse into the gapminder dataset
gapminder %>%           # Pipe into the next function
  filter(year == 1952)  # Filter if year is equal to 1952

Data visualization

Data visualization

"The simple graph has brought more information to the data analyst’s mind than any other device." — John Tukey

  • Data visualization is the creation and study of the visual representation of data.

  • There are many tools for visualizing data (R is one of them), and many approaches/systems within R for making data visualizations (ggplot2 is one of them, and that's the one we're going to use).

ggplot2

  • To use ggplot2 functions, first load the package
library(ggplot2)
  • In ggplot2 the structure of the code for plots can often be summarized as
ggplot + 
  geom_xxx

or, more precisely

ggplot(data = [dataset], aes(x = [x-variable], y = [y-variable])) +
   geom_xxx() +
   other options
  • Geoms, short for geometric objects, describe the type of plot you will produce

About ggplot2

  • ggplot2 is the name of the package
  • The gg in "ggplot2" stands for Grammar of Graphics
  • Inspired by the book Grammar of Graphics by Lee Wilkinson
  • ggplot() is the main function in ggplot2

Visualizing Star Wars

Star Wars data

The dplyr package contains a dataset called starwars:

library(dplyr)
starwars
## # A tibble: 87 x 13
##                  name height  mass    hair_color  skin_color eye_color
##                 <chr>  <int> <dbl>         <chr>       <chr>     <chr>
##  1     Luke Skywalker    172    77         blond        fair      blue
##  2              C-3PO    167    75          <NA>        gold    yellow
##  3              R2-D2     96    32          <NA> white, blue       red
##  4        Darth Vader    202   136          none       white    yellow
##  5        Leia Organa    150    49         brown       light     brown
##  6          Owen Lars    178   120   brown, grey       light      blue
##  7 Beru Whitesun lars    165    75         brown       light      blue
##  8              R5-D4     97    32          <NA>  white, red       red
##  9  Biggs Darklighter    183    84         black       light     brown
## 10     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
## # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
## #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Dataset terminology

What does each row represent? What does each column represent?

starwars
## # A tibble: 87 x 13
##                  name height  mass    hair_color  skin_color eye_color
##                 <chr>  <int> <dbl>         <chr>       <chr>     <chr>
##  1     Luke Skywalker    172    77         blond        fair      blue
##  2              C-3PO    167    75          <NA>        gold    yellow
##  3              R2-D2     96    32          <NA> white, blue       red
##  4        Darth Vader    202   136          none       white    yellow
##  5        Leia Organa    150    49         brown       light     brown
##  6          Owen Lars    178   120   brown, grey       light      blue
##  7 Beru Whitesun lars    165    75         brown       light      blue
##  8              R5-D4     97    32          <NA>  white, red       red
##  9  Biggs Darklighter    183    84         black       light     brown
## 10     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
## # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
## #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Luke Skywalker

luke-skywalker

What's in the Star Wars data?

Take a glimpse at the data:

glimpse(starwars)
## Observations: 87
## Variables: 13
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", ...
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188...
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 8...
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "b...
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "l...
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue",...
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0...
## $ gender     <chr> "male", NA, NA, "male", "female", "male", "female",...
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alder...
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human...
## $ films      <list> [<"Revenge of the Sith", "Return of the Jedi", "Th...
## $ vehicles   <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>,...
## $ starships  <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Adva...

What's in the Star Wars data?

Run the following in the Console to view the help

?starwars

How many rows and columns does this dataset have? What does each row represent? What does each column represent?

Make a prediction: What relationship do you expect to see between height and mass?

Scatterplots

Mass vs. height (geom_point())

ggplot(data = starwars, aes(x = height, y = mass)) +
  geom_point()
## Warning: Removed 28 rows containing missing values (geom_point).

  • Not all characters have height and mass information (hence 28 of them not plotted)

Mass vs. height

How would you describe this relationship? What other variables would help us understand data points that don't follow the overall trend?

Mass vs. height

Who is the not so tall but really chubby character?

Additional variables

Can display additional variables with

  • aesthetics (like shape, colour, size), or

  • faceting (small multiples displaying different subsets)

Aesthetics

Aesthetics options

Visual characteristics of plotting characters that can be mapped to data are

  • color

  • size

  • shape

  • alpha (transparency)

Mass vs. height + gender

ggplot(data = starwars, aes(x = height, y = mass, color = gender)) +
  geom_point()
## Warning: Removed 28 rows containing missing values (geom_point).

Aesthetics summary

  • Continuous variable are measured on a continuous scale
  • Discrete variables are measured (or often counted) on a discrete scale
aesthetics discrete continuous
color rainbow of colors gradient
size discrete steps linear mapping between radius and value
shape different shape for each shouldn't (and doesn't) work

Faceting

Faceting options

  • Smaller plots that display different subsets of the data

  • Useful for exploring conditional relationships and large data

Mass vs. height by gender

ggplot(data = starwars, aes(x = height, y = mass)) +
  facet_grid(. ~ gender) +
  geom_point()
## Warning: Removed 28 rows containing missing values (geom_point).

Dive further…

In the next few slides describe what each plot displays. Think about how the code relates to the output.

ggplot(data = starwars, aes(x = height, y = mass)) +
  facet_grid(gender ~ .) +
  geom_point()
## Warning: Removed 28 rows containing missing values (geom_point).

ggplot(data = starwars, aes(x = height, y = mass)) +
  facet_grid(. ~ gender) +
  geom_point()
## Warning: Removed 28 rows containing missing values (geom_point).

ggplot(data = starwars, aes(x = height, y = mass)) +
  facet_wrap(~ eye_color) +
  geom_point()  
## Warning: Removed 28 rows containing missing values (geom_point).

Facet summary

  • facet_grid(): 2d grid, rows ~ cols, . for no split

  • facet_wrap(): 1d ribbon wrapped into 2d

Other geoms

/Height vs. mass, take 2

How are these plots similar? How are they different?

geom_smooth

To plot a smooth curve, use geom_smooth()

ggplot(data = starwars, aes(x = height, y = mass)) +
  geom_smooth()

Exploratory data analysis (EDA)

Number of variables involved

  • Univariate data analysis - distribution of single variable

  • Bivariate data analysis - relationship between two variables

  • Multivariate data analysis - relationship between many variables at once, usually focusing on the relationship between two while conditioning for others

Types of variables

  • Numerical variables can be classified as continuous or discrete based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively.

  • If the variable is categorical, we can determine if it is ordinal based on whether or not the levels have a natural ordering.

Describing shapes of numerical distributions

  • shape:
    • skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail)
    • modality: unimodal, bimodal, multimodal, uniform
  • center: mean (mean), median (median), mode (not always useful)
  • spead: range (range), standard deviation (sd), inter-quartile range (IQR)
  • unusal observations

Histograms

For numerical variables

ggplot(starwars, aes(x = height)) +
  geom_histogram(binwidth = 10)
## Warning: Removed 6 rows containing non-finite values (stat_bin).

Bar plots

For categorical variables

ggplot(starwars, aes(x = gender)) +
  geom_bar()