Data types and Tidy data


Today’s agenda

Today’s agenda

  • Highlights from the “Tidy data” paper

  • Visualizing data with ggplot2
    • Goal: Learn syntax and various visualizations for different data types

Tidy data

Tidy data

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.


Example

Which of the following is a data set is tidy and which is not? Explain.


## # A tibble: 142 x 3
##                     country continent lifeExp
##                      <fctr>    <fctr>   <dbl>
## 1                   Algeria    Africa  72.301
## 2                    Angola    Africa  42.731
## 3                     Benin    Africa  56.728
## 4                  Botswana    Africa  50.728
## 5              Burkina Faso    Africa  52.295
## 6                   Burundi    Africa  49.580
## 7                  Cameroon    Africa  50.430
## 8  Central African Republic    Africa  44.741
## 9                      Chad    Africa  50.651
## 10                  Comoros    Africa  65.152
## # ... with 132 more rows
## # A tibble: 5 x 2
##   continent mean_lifeexp
##      <fctr>        <dbl>
## 1    Africa     54.80604
## 2  Americas     73.60812
## 3      Asia     70.72848
## 4    Europe     77.64860
## 5   Oceania     80.71950





Messy data

Why is the following data not considered tidy?

##   id       trt work.T1 home.T1 work.T2 home.T2
## 1  1 treatment   0.085   0.616   0.114   0.052
## 2  2   control   0.225   0.430   0.596   0.264
## 3  3 treatment   0.275   0.652   0.358   0.399
## 4  4   control   0.272   0.568   0.429   0.836

Tidied data

##    id       trt location time value
## 1   1 treatment     home   T1 0.616
## 2   1 treatment     work   T1 0.085
## 3   1 treatment     home   T2 0.052
## 4   1 treatment     work   T2 0.114
## 5   2   control     home   T1 0.430
## 6   2   control     work   T1 0.225
## 7   2   control     home   T2 0.264
## 8   2   control     work   T2 0.596
## 9   3 treatment     home   T1 0.652
## 10  3 treatment     work   T1 0.275
## 11  3 treatment     home   T2 0.399
## 12  3 treatment     work   T2 0.358
## 13  4   control     home   T1 0.568
## 14  4   control     work   T1 0.272
## 15  4   control     home   T2 0.836
## 16  4   control     work   T2 0.429

A taxonomy of variables

  • Numerical (quantitative) - takes on a numerical values
    • Ask yourself - is it sensible to add, subtract, or calculate an average of these values?
  • Categorical (qualitative) - takes on one of a set of distinct categories
    • Ask yourself - are there only certain values (or categories) possible? Even if the categories can be identified with numbers, check if it would be sensible to do arithmetic operations with these values.

Numerical data

  • Continuous - data that is measured, any numerical (decimal) value

  • Discrete - data that is counted, only whole non-negative numbers

Categorical data

  • Ordinal - data where the categories have a natural order

  • Regular categorical - categories do not have a natural order

Example - Class Survey

Students in an introductory statistics course were asked the following questions as part of a class survey:

  1. Are you introverted or extraverted?
  2. On average, how much sleep do you get per night?
  3. When do you go to bed: 8pm-10pm, 10pm-12am, 12am-2am, later than 2am?
  4. How many countries have you visited?
  5. On a scale of 1 (very little) - 5 (a lot), how much math anxiety do you have?

What type of data is each variable described above?

Data visualization

ggplot2

To use ggplot2 functions, first load the package

library(ggplot2)

In ggplot2 the structure of the code for plots will take the form

ggplot(arguments) + 
  geom_xxxxx(additional_arguments)

or, more precisely

ggplot(data = [dataset], aes(x = [x-variable], y = [y-variable], [opt. arguments])) +
   geom_xxx() +
   other options

Geoms, short for geometric objects, describe the type of plot you will produce using the grammar of graphics.

Context: Fuel economy

mpg - fuel economy data from 1999 and 2008 for 38 popular models of cars

Run the following in the Console:

# see the help file
?mpg
# view data
View(mpg)

How many rows and columns does this dataset have?

Is the data tidy?

What relationship do you expect to see between engine size (displ) and mileage (hwy)?

Scatterplots

Displacement vs. highway mpg (geom_point())

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point()

How would you describe this relationship?

Additional variables

What other variables would help us understand data points that don’t follow the overall trend?


We can display additional variables using

  • aesthetics (like shape, colour, size), or

  • faceting (small multiples displaying different subsets of the data)

Aesthetics

Aesthetics options

Aesthetics are visual characteristics of the plot that can be mapped to the data. Some common examples,

  • color

  • size

  • shape

  • alpha (transparency)

Displacement vs. highway mpg + class

ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point()

Displacement vs. highway mpg + cyl

ggplot(data = mpg, aes(x = displ, y = hwy, color = cyl)) +
  geom_point()

Displacement vs. highway mpg + cyl

ggplot(data = mpg, aes(x = displ, y = hwy, color = factor(cyl))) +
  geom_point()

Aesthetics summary

aesthetics categorical numeric
color rainbow of colors gradient
size discrete steps linear mapping between radius and value
shape different shape for each category shouldn’t (and doesn’t) work

Faceting

Faceting options

  • Smaller plots that display different subsets of the data

  • Useful for exploring conditional relationships and large data

Displacement vs. highway mpg by cylinders

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl) +
  geom_point()

Dive further…

In the next few slides describe what each plot displays. Think about how the code relates to the output.

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .) +
  geom_point()

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  facet_grid(drv ~ cyl) +
  geom_point()

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  facet_wrap(~ class) +
  geom_point()  

Facet summary

  • facet_grid(): 2d grid, rows ~ cols, . for no split

  • facet_wrap(): 1d ribbon wrapped into 2d

Other geoms

Displacement vs. highway mpg, take 2

How are these plots similar? How are they different?

geom_smooth

To plot a smooth curve, we can use geom_smooth()

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_smooth()

geom_smooth + geom_point

We can combine plots by adding additional layers

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_smooth() + 
  geom_point()