From last time…

A note on piping and layering

  • The %>% operator in dplyr functions is called the pipe operator. This means you “pipe” the output of the previous line of code as the first input of the next line of code.

  • The + operator in ggplot2 functions is used for “layering”. This means you create the plot in layers, separated by +.

Today’s agenda

Today’s agenda

  • Highlights from the “Tidy data” paper

  • Visualizing data with ggplot2
    • Goal: Learn syntax and various visualizations for different data types

Tidy data

Tidy data

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.
Which of the following is a data set and which is a summary table?
## Source: local data frame [5 x 2]
## 
##   continent mean_lifeexp
## 1    Africa     54.80604
## 2  Americas     73.60812
## 3      Asia     70.72848
## 4    Europe     77.64860
## 5   Oceania     80.71950





## Source: local data frame [142 x 3]
## 
##                     country continent lifeExp
## 1                   Algeria    Africa  72.301
## 2                    Angola    Africa  42.731
## 3                     Benin    Africa  56.728
## 4                  Botswana    Africa  50.728
## 5              Burkina Faso    Africa  52.295
## 6                   Burundi    Africa  49.580
## 7                  Cameroon    Africa  50.430
## 8  Central African Republic    Africa  44.741
## 9                      Chad    Africa  50.651
## 10                  Comoros    Africa  65.152
## ..                      ...       ...     ...

Data visualization

ggplot2

  • To use ggplot2 functions, first load the package
library(ggplot2)
  • In ggplot2 the structure of the code for plots can often be summarized as
ggplot + 
  geom_xxx

or, more precisely

ggplot(data = [dataset], aes(x = [x-variable], [y-variable])) +
   geom_xxx() +
   other options
  • Geoms, short for geometric objects, describe the type of plot you will produce

Context: Fuel economy

data: mpg - fuel economy data from 1999 and 2008 for 38 popular models of cars

  • Run the following in the Console:
# see the help file
?mpg
# view data
View(mpg)
How many rows and columns does this dataset have? What does each row represent? What does each column represent?
Make a prediction: What relationship do you expect to see between engine size (displ) and mileage (hwy)?

Scatterplots

Displacement vs. highway mpg (geom_point())

How would you describe this relationship? What other variables would help us understand data points that don’t follow the overall trend?
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point()

Additional variables

Can display additional variables with

  • aesthetics (like shape, colour, size), or

  • faceting (small multiples displaying different subsets)

Aesthetics

Aesthetics options

Visual characteristics of plotting characters that can be mapped to data are

  • color

  • size

  • shape

  • alpha (transparency)

Displacement vs. highway mpg + class

ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point()

Aesthetics summary

aesthetics discrete continuous
color rainbow of colors gradient
size discrete steps linear mapping between radius and value
shape different shape for each shouldn’t (and doesn’t) work

Faceting

Faceting options

  • Smaller plots that display different subsets of the data

  • Useful for exploring conditional relationships and large data

Displacement vs. highway mpg by cylinders

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl) +
  geom_point()

Dive further…

In the next few slides describe what each plot displays. Think about how the code relates to the output.

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .) +
  geom_point()

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  facet_grid(drv ~ cyl) +
  geom_point()

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  facet_wrap(~ class) +
  geom_point()  

Facet summary

  • facet_grid(): 2d grid, rows ~ cols, . for no split

  • facet_wrap(): 1d ribbon wrapped into 2d

Other geoms

Displacement vs. highway mpg, take 2

How are these plots similar? How are they different?

geom_smooth

To plot a smooth curve, use geom_smooth()

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_smooth()

Exploratory data analysis (EDA)

Number of variables involved

  • Univariate data analysis - distribution of single variable

  • Bivariate data analysis - relationship between two variables

  • Multivariate data analysis - relationship between many variables at once, usually focusing on the relationship between two while conditioning for others

Types of variables

  • Numerical variables can be classified as continuous or discrete based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively.

  • If the variable is categorical, we can determine if it is ordinal based on whether or not the levels have a natural ordering.

Describing shapes of numerical distributions

  • shape:
    • skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail)
    • modality: unimodal, bimodal, multimodal, uniform
  • center: mean (mean), median (median), mode (not always useful)
  • spead: range (range), standard deviation (sd), inter-quartile range (IQR)
  • unusal observations

[Put these, and more, to use in HW 1]

Homework

Homework 1

  • Can collaborate with others, but must submit own work

  • Submission on GitHub (follow instructions on HW)

  • Due: By class on Thursday (Sep 11)
    • Will get a chance to work on it in class on Tuesday, but make sure to at least have read over it before then
    • Prof. C-R hours on Monday 3:30 - 5:30pm
    • Ask questions on Piazza