Data types and Tidy data

Today's agenda

Highlights from the "Tidy data" paper
Visualizing data with ggplot2
- Goal: Learn syntax and various visualizations for different data types

Tidy data

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

Example

Which of the following is a data set is tidy and which is not? Explain.

## # A tibble: 142 x 3
##                     country continent lifeExp
##                      <fctr>    <fctr>   <dbl>
## 1                   Algeria    Africa  72.301
## 2                    Angola    Africa  42.731
## 3                     Benin    Africa  56.728
## 4                  Botswana    Africa  50.728
## 5              Burkina Faso    Africa  52.295
## 6                   Burundi    Africa  49.580
## 7                  Cameroon    Africa  50.430
## 8  Central African Republic    Africa  44.741
## 9                      Chad    Africa  50.651
## 10                  Comoros    Africa  65.152
## # ... with 132 more rows

## # A tibble: 5 x 2
##   continent mean_lifeexp
##      <fctr>        <dbl>
## 1    Africa     54.80604
## 2  Americas     73.60812
## 3      Asia     70.72848
## 4    Europe     77.64860
## 5   Oceania     80.71950

Messy data

Why is the following data not considered tidy?

##   id       trt work.T1 home.T1 work.T2 home.T2
## 1  1 treatment   0.085   0.616   0.114   0.052
## 2  2   control   0.225   0.430   0.596   0.264
## 3  3 treatment   0.275   0.652   0.358   0.399
## 4  4   control   0.272   0.568   0.429   0.836

Tidied data

##    id       trt location time value
## 1   1 treatment     home   T1 0.616
## 2   1 treatment     work   T1 0.085
## 3   1 treatment     home   T2 0.052
## 4   1 treatment     work   T2 0.114
## 5   2   control     home   T1 0.430
## 6   2   control     work   T1 0.225
## 7   2   control     home   T2 0.264
## 8   2   control     work   T2 0.596
## 9   3 treatment     home   T1 0.652
## 10  3 treatment     work   T1 0.275
## 11  3 treatment     home   T2 0.399
## 12  3 treatment     work   T2 0.358
## 13  4   control     home   T1 0.568
## 14  4   control     work   T1 0.272
## 15  4   control     home   T2 0.836
## 16  4   control     work   T2 0.429

A taxonomy of variables

Numerical (quantitative) - takes on a numerical values
- Ask yourself - is it sensible to add, subtract, or calculate an average of these values?
Categorical (qualitative) - takes on one of a set of distinct categories
- Ask yourself - are there only certain values (or categories) possible? Even if the categories can be identified with numbers, check if it would be sensible to do arithmetic operations with these values.

Numerical data

Continuous - data that is measured, any numerical (decimal) value
Discrete - data that is counted, only whole non-negative numbers

Categorical data

Ordinal - data where the categories have a natural order
Regular categorical - categories do not have a natural order

Example - Class Survey

Students in an introductory statistics course were asked the following questions as part of a class survey:

Are you introverted or extraverted?
On average, how much sleep do you get per night?
When do you go to bed: 8pm-10pm, 10pm-12am, 12am-2am, later than 2am?
How many countries have you visited?
On a scale of 1 (very little) - 5 (a lot), how much math anxiety do you have?

What type of data is each variable described above?

Data visualization

`ggplot2`

To use ggplot2 functions, first load the package

library(ggplot2)

In ggplot2 the structure of the code for plots will take the form

ggplot(arguments) + 
  geom_xxxxx(additional_arguments)

or, more precisely

ggplot(data = [dataset], aes(x = [x-variable], y = [y-variable], [opt. arguments])) +
   geom_xxx() +
   other options

Geoms, short for geometric objects, describe the type of plot you will produce using the grammar of graphics.

Context: Fuel economy

mpg - fuel economy data from 1999 and 2008 for 38 popular models of cars

Run the following in the Console:

# see the help file
?mpg

# view data
View(mpg)

How many rows and columns does this dataset have?

Is the data tidy?

What relationship do you expect to see between engine size (displ) and mileage (hwy)?

Scatterplots

Displacement vs. highway mpg (`geom_point()`)

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point()

How would you describe this relationship?

Additional variables

What other variables would help us understand data points that don't follow the overall trend?

We can display additional variables using

aesthetics (like shape, colour, size), or
faceting (small multiples displaying different subsets of the data)

Aesthetics

Aesthetics options

Aesthetics are visual characteristics of the plot that can be mapped to the data. Some common examples,

color
size
shape
alpha (transparency)

Displacement vs. highway mpg + class

ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point()

Displacement vs. highway mpg + cyl

ggplot(data = mpg, aes(x = displ, y = hwy, color = cyl)) +
  geom_point()

Displacement vs. highway mpg + cyl

ggplot(data = mpg, aes(x = displ, y = hwy, color = factor(cyl))) +
  geom_point()

Aesthetics summary

aesthetics	categorical	numeric
color	rainbow of colors	gradient
size	discrete steps	linear mapping between radius and value
shape	different shape for each category	shouldn't (and doesn't) work

Faceting

Faceting options

Smaller plots that display different subsets of the data
Useful for exploring conditional relationships and large data

Displacement vs. highway mpg by cylinders

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl) +
  geom_point()

Dive further…

In the next few slides describe what each plot displays. Think about how the code relates to the output.

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .) +
  geom_point()

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  facet_grid(drv ~ cyl) +
  geom_point()

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  facet_wrap(~ class) +
  geom_point()

Facet summary

facet_grid(): 2d grid, rows ~ cols, . for no split
facet_wrap(): 1d ribbon wrapped into 2d

Other geoms

Displacement vs. highway mpg, take 2

How are these plots similar? How are they different?

geom_smooth

To plot a smooth curve, we can use geom_smooth()

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_smooth()

geom_smooth + geom_point

We can combine plots by adding additional layers

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_smooth() + 
  geom_point()

Today's agenda

Today's agenda

Tidy data

Tidy data

Example

Messy data

Tidied data

A taxonomy of variables

Numerical data

Categorical data

Example - Class Survey

Data visualization

ggplot2

Context: Fuel economy

Scatterplots

Displacement vs. highway mpg (geom_point())

Additional variables

Aesthetics

Aesthetics options

Displacement vs. highway mpg + class

Displacement vs. highway mpg + cyl

Displacement vs. highway mpg + cyl

Aesthetics summary

Faceting

Faceting options

Displacement vs. highway mpg by cylinders

Dive further…

Facet summary

Other geoms

Displacement vs. highway mpg, take 2

geom_smooth

geom_smooth + geom_point

`ggplot2`

Displacement vs. highway mpg (`geom_point()`)