The %>%
operator in dplyr
functions is called the pipe operator. This means you “pipe” the output of the previous line of code as the first input of the next line of code.
The +
operator in ggplot2
functions is used for “layering”. This means you create the plot in layers, separated by +
.
Highlights from the “Tidy data” paper
ggplot2
## Source: local data frame [5 x 2]
##
## continent mean_lifeexp
## 1 Africa 54.80604
## 2 Americas 73.60812
## 3 Asia 70.72848
## 4 Europe 77.64860
## 5 Oceania 80.71950
## Source: local data frame [142 x 3]
##
## country continent lifeExp
## 1 Algeria Africa 72.301
## 2 Angola Africa 42.731
## 3 Benin Africa 56.728
## 4 Botswana Africa 50.728
## 5 Burkina Faso Africa 52.295
## 6 Burundi Africa 49.580
## 7 Cameroon Africa 50.430
## 8 Central African Republic Africa 44.741
## 9 Chad Africa 50.651
## 10 Comoros Africa 65.152
## .. ... ... ...
ggplot2
ggplot2
functions, first load the packagelibrary(ggplot2)
ggplot2
the structure of the code for plots can often be summarized asggplot +
geom_xxx
or, more precisely
ggplot(data = [dataset], aes(x = [x-variable], [y-variable])) +
geom_xxx() +
other options
data: mpg
- fuel economy data from 1999 and 2008 for 38 popular models of cars
# see the help file
?mpg
# view data
View(mpg)
geom_point()
)ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point()
Can display additional variables with
aesthetics (like shape, colour, size), or
faceting (small multiples displaying different subsets)
Visual characteristics of plotting characters that can be mapped to data are
color
size
shape
alpha
(transparency)
ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) +
geom_point()
aesthetics | discrete | continuous |
---|---|---|
color | rainbow of colors | gradient |
size | discrete steps | linear mapping between radius and value |
shape | different shape for each | shouldn’t (and doesn’t) work |
Smaller plots that display different subsets of the data
Useful for exploring conditional relationships and large data
ggplot(data = mpg, aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl) +
geom_point()
ggplot(data = mpg, aes(x = displ, y = hwy)) +
facet_grid(drv ~ .) +
geom_point()
ggplot(data = mpg, aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl) +
geom_point()
ggplot(data = mpg, aes(x = displ, y = hwy)) +
facet_wrap(~ class) +
geom_point()
facet_grid()
: 2d grid, rows ~ cols, . for no split
facet_wrap()
: 1d ribbon wrapped into 2d
geom_smooth
To plot a smooth curve, use geom_smooth()
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_smooth()
Univariate data analysis - distribution of single variable
Bivariate data analysis - relationship between two variables
Multivariate data analysis - relationship between many variables at once, usually focusing on the relationship between two while conditioning for others
Numerical variables can be classified as continuous or discrete based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively.
If the variable is categorical, we can determine if it is ordinal based on whether or not the levels have a natural ordering.
mean
), median (median
), mode (not always useful)range
), standard deviation (sd
), inter-quartile range (IQR
)[Put these, and more, to use in HW 1]
Can collaborate with others, but must submit own work
Submission on GitHub (follow instructions on HW)