Highlights from the "Tidy data" paper
ggplot2
Which of the following is a data set is tidy and which is not? Explain.
## # A tibble: 142 x 3 ## country continent lifeExp ## <fctr> <fctr> <dbl> ## 1 Algeria Africa 72.301 ## 2 Angola Africa 42.731 ## 3 Benin Africa 56.728 ## 4 Botswana Africa 50.728 ## 5 Burkina Faso Africa 52.295 ## 6 Burundi Africa 49.580 ## 7 Cameroon Africa 50.430 ## 8 Central African Republic Africa 44.741 ## 9 Chad Africa 50.651 ## 10 Comoros Africa 65.152 ## # ... with 132 more rows
## # A tibble: 5 x 2 ## continent mean_lifeexp ## <fctr> <dbl> ## 1 Africa 54.80604 ## 2 Americas 73.60812 ## 3 Asia 70.72848 ## 4 Europe 77.64860 ## 5 Oceania 80.71950
Why is the following data not considered tidy?
## id trt work.T1 home.T1 work.T2 home.T2 ## 1 1 treatment 0.085 0.616 0.114 0.052 ## 2 2 control 0.225 0.430 0.596 0.264 ## 3 3 treatment 0.275 0.652 0.358 0.399 ## 4 4 control 0.272 0.568 0.429 0.836
## id trt location time value ## 1 1 treatment home T1 0.616 ## 2 1 treatment work T1 0.085 ## 3 1 treatment home T2 0.052 ## 4 1 treatment work T2 0.114 ## 5 2 control home T1 0.430 ## 6 2 control work T1 0.225 ## 7 2 control home T2 0.264 ## 8 2 control work T2 0.596 ## 9 3 treatment home T1 0.652 ## 10 3 treatment work T1 0.275 ## 11 3 treatment home T2 0.399 ## 12 3 treatment work T2 0.358 ## 13 4 control home T1 0.568 ## 14 4 control work T1 0.272 ## 15 4 control home T2 0.836 ## 16 4 control work T2 0.429
Continuous - data that is measured, any numerical (decimal) value
Discrete - data that is counted, only whole non-negative numbers
Ordinal - data where the categories have a natural order
Regular categorical - categories do not have a natural order
Students in an introductory statistics course were asked the following questions as part of a class survey:
What type of data is each variable described above?
ggplot2
To use ggplot2
functions, first load the package
library(ggplot2)
In ggplot2
the structure of the code for plots will take the form
ggplot(arguments) + geom_xxxxx(additional_arguments)
or, more precisely
ggplot(data = [dataset], aes(x = [x-variable], y = [y-variable], [opt. arguments])) + geom_xxx() + other options
Geoms, short for geometric objects, describe the type of plot you will produce using the grammar of graphics.
mpg
- fuel economy data from 1999 and 2008 for 38 popular models of cars
Run the following in the Console:
# see the help file ?mpg
# view data View(mpg)
How many rows and columns does this dataset have?
Is the data tidy?
What relationship do you expect to see between engine size (displ
) and mileage (hwy
)?
geom_point()
)ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_point()
How would you describe this relationship?
What other variables would help us understand data points that don't follow the overall trend?
We can display additional variables using
aesthetics (like shape, colour, size), or
faceting (small multiples displaying different subsets of the data)
Aesthetics are visual characteristics of the plot that can be mapped to the data. Some common examples,
color
size
shape
alpha
(transparency)
ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) + geom_point()
ggplot(data = mpg, aes(x = displ, y = hwy, color = cyl)) + geom_point()
ggplot(data = mpg, aes(x = displ, y = hwy, color = factor(cyl))) + geom_point()
aesthetics | categorical | numeric |
---|---|---|
color | rainbow of colors | gradient |
size | discrete steps | linear mapping between radius and value |
shape | different shape for each category | shouldn't (and doesn't) work |
Smaller plots that display different subsets of the data
Useful for exploring conditional relationships and large data
ggplot(data = mpg, aes(x = displ, y = hwy)) + facet_grid(. ~ cyl) + geom_point()
In the next few slides describe what each plot displays. Think about how the code relates to the output.
ggplot(data = mpg, aes(x = displ, y = hwy)) + facet_grid(drv ~ .) + geom_point()
ggplot(data = mpg, aes(x = displ, y = hwy)) + facet_grid(drv ~ cyl) + geom_point()
ggplot(data = mpg, aes(x = displ, y = hwy)) + facet_wrap(~ class) + geom_point()
facet_grid()
: 2d grid, rows ~ cols, . for no split
facet_wrap()
: 1d ribbon wrapped into 2d
How are these plots similar? How are they different?
To plot a smooth curve, we can use geom_smooth()
ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_smooth()
We can combine plots by adding additional layers
ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_smooth() + geom_point()