September 5, 2017

Getting started

Follow up from last time…

  • Any questions on material from last time?

  • Any questions on homework?

  • Any questions on workflow / course structure?

  • Catch up on informal "requirements":
    • Did everyone download the Slack app?
    • Did everyone adjust their Slack username to be/include their (preferred) first name?
    • Did everyone add their photo to the GitHub and Slack profiles?

R in a nutshell

Functions in R are often verbs, and then in parantheses are the arguments for those functions.

verb(what-you-want-to-apply-verb-to, other-arguments)

For example:

glimpse(gapminder)      # Glimpse into the gapminder dataset
gapminder %>%           # Pipe into the next function
  filter(year == 1952)  # Filter if year is equal to 1952

Data visualization

Data visualization

"The simple graph has brought more information to the data analyst’s mind than any other device." — John Tukey

  • Data visualization is the creation and study of the visual representation of data.

  • There are many tools for visualizing data (R is one of them), and many approaches/systems within R for making data visualizations (ggplot2 is one of them, and that's the one we're going to use).

ggplot2

  • To use ggplot2 functions, first load the package
library(ggplot2)
  • In ggplot2 the structure of the code for plots can often be summarized as
ggplot + 
  geom_xxx

or, more precisely

ggplot(data = [dataset], aes(x = [x-variable], y = [y-variable])) +
   geom_xxx() +
   other options
  • Geoms, short for geometric objects, describe the type of plot you will produce

About ggplot2

  • ggplot2 is the name of the package
  • The gg in "ggplot2" stands for Grammar of Graphics
  • Inspired by the book Grammar of Graphics by Lee Wilkinson
  • ggplot() is the main function in ggplot2

Visualizing Star Wars

Star Wars data

The dplyr package contains a dataset called starwars:

library(dplyr)
starwars
## # A tibble: 87 x 13
##                  name height  mass    hair_color  skin_color eye_color
##                 <chr>  <int> <dbl>         <chr>       <chr>     <chr>
##  1     Luke Skywalker    172    77         blond        fair      blue
##  2              C-3PO    167    75          <NA>        gold    yellow
##  3              R2-D2     96    32          <NA> white, blue       red
##  4        Darth Vader    202   136          none       white    yellow
##  5        Leia Organa    150    49         brown       light     brown
##  6          Owen Lars    178   120   brown, grey       light      blue
##  7 Beru Whitesun lars    165    75         brown       light      blue
##  8              R5-D4     97    32          <NA>  white, red       red
##  9  Biggs Darklighter    183    84         black       light     brown
## 10     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
## # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
## #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Dataset terminology

What does each row represent? What does each column represent?

starwars
## # A tibble: 87 x 13
##                  name height  mass    hair_color  skin_color eye_color
##                 <chr>  <int> <dbl>         <chr>       <chr>     <chr>
##  1     Luke Skywalker    172    77         blond        fair      blue
##  2              C-3PO    167    75          <NA>        gold    yellow
##  3              R2-D2     96    32          <NA> white, blue       red
##  4        Darth Vader    202   136          none       white    yellow
##  5        Leia Organa    150    49         brown       light     brown
##  6          Owen Lars    178   120   brown, grey       light      blue
##  7 Beru Whitesun lars    165    75         brown       light      blue
##  8              R5-D4     97    32          <NA>  white, red       red
##  9  Biggs Darklighter    183    84         black       light     brown
## 10     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
## # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
## #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Luke Skywalker

luke-skywalker

What's in the Star Wars data?

Take a glimpse at the data:

glimpse(starwars)
## Observations: 87
## Variables: 13
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", ...
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188...
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 8...
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "b...
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "l...
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue",...
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0...
## $ gender     <chr> "male", NA, NA, "male", "female", "male", "female",...
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alder...
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human...
## $ films      <list> [<"Revenge of the Sith", "Return of the Jedi", "Th...
## $ vehicles   <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>,...
## $ starships  <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Adva...

What's in the Star Wars data?

Run the following in the Console to view the help

?starwars

How many rows and columns does this dataset have? What does each row represent? What does each column represent?

Make a prediction: What relationship do you expect to see between height and mass?

Scatterplots

Mass vs. height (geom_point())

ggplot(data = starwars, aes(x = height, y = mass)) +
  geom_point()
## Warning: Removed 28 rows containing missing values (geom_point).

  • Not all characters have height and mass information (hence 28 of them not plotted)

Mass vs. height

How would you describe this relationship? What other variables would help us understand data points that don't follow the overall trend?

Mass vs. height

Who is the not so tall but really chubby character?