From last time…

A note on piping and layering

The %>% operator in dplyr functions is called the pipe operator. This means you “pipe” the output of the previous line of code as the first input of the next line of code.
The + operator in ggplot2 functions is used for “layering”. This means you create the plot in layers, separated by +.

Today’s agenda

Highlights from the “Tidy data” paper
Visualizing data with ggplot2
- Goal: Learn syntax and various visualizations for different data types

Tidy data

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

Which of the following is a data set and which is a summary table?

## Source: local data frame [5 x 2]
## 
##   continent mean_lifeexp
## 1    Africa     54.80604
## 2  Americas     73.60812
## 3      Asia     70.72848
## 4    Europe     77.64860
## 5   Oceania     80.71950

## Source: local data frame [142 x 3]
## 
##                     country continent lifeExp
## 1                   Algeria    Africa  72.301
## 2                    Angola    Africa  42.731
## 3                     Benin    Africa  56.728
## 4                  Botswana    Africa  50.728
## 5              Burkina Faso    Africa  52.295
## 6                   Burundi    Africa  49.580
## 7                  Cameroon    Africa  50.430
## 8  Central African Republic    Africa  44.741
## 9                      Chad    Africa  50.651
## 10                  Comoros    Africa  65.152
## ..                      ...       ...     ...

Data visualization

`ggplot2`

To use ggplot2 functions, first load the package

library(ggplot2)

In ggplot2 the structure of the code for plots can often be summarized as

ggplot + 
  geom_xxx

or, more precisely

ggplot(data = [dataset], aes(x = [x-variable], [y-variable])) +
   geom_xxx() +
   other options

Geoms, short for geometric objects, describe the type of plot you will produce

Context: Fuel economy

data: mpg - fuel economy data from 1999 and 2008 for 38 popular models of cars

Run the following in the Console:

# see the help file
?mpg

# view data
View(mpg)

How many rows and columns does this dataset have? What does each row represent? What does each column represent?

Make a prediction: What relationship do you expect to see between engine size (displ) and mileage (hwy)?

Scatterplots

Displacement vs. highway mpg (`geom_point()`)

How would you describe this relationship? What other variables would help us understand data points that don’t follow the overall trend?

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point()

Additional variables

Can display additional variables with

aesthetics (like shape, colour, size), or
faceting (small multiples displaying different subsets)

Aesthetics

Aesthetics options

Visual characteristics of plotting characters that can be mapped to data are

color
size
shape
alpha (transparency)

Displacement vs. highway mpg + class

ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point()

Aesthetics summary

aesthetics	discrete	continuous
color	rainbow of colors	gradient
size	discrete steps	linear mapping between radius and value
shape	different shape for each	shouldn’t (and doesn’t) work

Faceting

Faceting options

Smaller plots that display different subsets of the data
Useful for exploring conditional relationships and large data

Displacement vs. highway mpg by cylinders

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl) +
  geom_point()

Dive further…

In the next few slides describe what each plot displays. Think about how the code relates to the output.

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .) +
  geom_point()

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  facet_grid(drv ~ cyl) +
  geom_point()

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  facet_wrap(~ class) +
  geom_point()

Facet summary

facet_grid(): 2d grid, rows ~ cols, . for no split
facet_wrap(): 1d ribbon wrapped into 2d

Other geoms

Displacement vs. highway mpg, take 2

How are these plots similar? How are they different?

`geom_smooth`

To plot a smooth curve, use geom_smooth()

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_smooth()

Exploratory data analysis (EDA)

Number of variables involved

Univariate data analysis - distribution of single variable
Bivariate data analysis - relationship between two variables
Multivariate data analysis - relationship between many variables at once, usually focusing on the relationship between two while conditioning for others

Types of variables

Numerical variables can be classified as continuous or discrete based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively.
If the variable is categorical, we can determine if it is ordinal based on whether or not the levels have a natural ordering.

Describing shapes of numerical distributions

shape:
- skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail)
- modality: unimodal, bimodal, multimodal, uniform
center: mean (mean), median (median), mode (not always useful)
spead: range (range), standard deviation (sd), inter-quartile range (IQR)
unusal observations

[Put these, and more, to use in HW 1]

Homework

Homework 1

Can collaborate with others, but must submit own work
Submission on GitHub (follow instructions on HW)
Due: By class on Thursday (Sep 11)
- Will get a chance to work on it in class on Tuesday, but make sure to at least have read over it before then
- Prof. C-R hours on Monday 3:30 - 5:30pm
- Ask questions on Piazza

Sta112FS
4. Tidy data & Intro to data visualization

Dr. Çetinkaya-Rundel

September 3, 2015

From last time…

A note on piping and layering

Today’s agenda

Today’s agenda

Tidy data

Tidy data

Data visualization

`ggplot2`

Context: Fuel economy

Scatterplots

Displacement vs. highway mpg (`geom_point()`)

Additional variables

Aesthetics

Aesthetics options

Displacement vs. highway mpg + class

Aesthetics summary

Faceting

Faceting options

Displacement vs. highway mpg by cylinders

Dive further…

Facet summary

Other geoms

Displacement vs. highway mpg, take 2

`geom_smooth`

Exploratory data analysis (EDA)

Number of variables involved

Types of variables

Describing shapes of numerical distributions

Homework

Homework 1

Sta112FS 4. Tidy data & Intro to data visualization

Dr. Çetinkaya-Rundel

September 3, 2015

From last time…

A note on piping and layering

Today’s agenda

Today’s agenda

Tidy data

Tidy data

Data visualization

ggplot2

Context: Fuel economy

Scatterplots

Displacement vs. highway mpg (geom_point())

Additional variables

Aesthetics

Aesthetics options

Displacement vs. highway mpg + class

Aesthetics summary

Faceting

Faceting options

Displacement vs. highway mpg by cylinders

Dive further…

Facet summary

Other geoms

Displacement vs. highway mpg, take 2

geom_smooth

Exploratory data analysis (EDA)

Number of variables involved

Types of variables

Describing shapes of numerical distributions

Homework

Homework 1

Sta112FS
4. Tidy data & Intro to data visualization

`ggplot2`

Displacement vs. highway mpg (`geom_point()`)

`geom_smooth`