September 5, 2017
Any questions on material from last time?
Any questions on homework?
Any questions on workflow / course structure?
Functions in R are often verbs, and then in parantheses are the arguments for those functions.
verb(what-you-want-to-apply-verb-to, other-arguments)
For example:
glimpse(gapminder) # Glimpse into the gapminder dataset
gapminder %>% # Pipe into the next function filter(year == 1952) # Filter if year is equal to 1952
"The simple graph has brought more information to the data analyst’s mind than any other device." — John Tukey
Data visualization is the creation and study of the visual representation of data.
There are many tools for visualizing data (R is one of them), and many approaches/systems within R for making data visualizations (ggplot2 is one of them, and that's the one we're going to use).
library(ggplot2)
ggplot + geom_xxx
or, more precisely
ggplot(data = [dataset], aes(x = [x-variable], y = [y-variable])) + geom_xxx() + other options
gg
in "ggplot2" stands for Grammar of Graphicsggplot()
is the main function in ggplot2The dplyr package contains a dataset called starwars
:
library(dplyr) starwars
## # A tibble: 87 x 13 ## name height mass hair_color skin_color eye_color ## <chr> <int> <dbl> <chr> <chr> <chr> ## 1 Luke Skywalker 172 77 blond fair blue ## 2 C-3PO 167 75 <NA> gold yellow ## 3 R2-D2 96 32 <NA> white, blue red ## 4 Darth Vader 202 136 none white yellow ## 5 Leia Organa 150 49 brown light brown ## 6 Owen Lars 178 120 brown, grey light blue ## 7 Beru Whitesun lars 165 75 brown light blue ## 8 R5-D4 97 32 <NA> white, red red ## 9 Biggs Darklighter 183 84 black light brown ## 10 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray ## # ... with 77 more rows, and 7 more variables: birth_year <dbl>, ## # gender <chr>, homeworld <chr>, species <chr>, films <list>, ## # vehicles <list>, starships <list>
What does each row represent? What does each column represent?
starwars
## # A tibble: 87 x 13 ## name height mass hair_color skin_color eye_color ## <chr> <int> <dbl> <chr> <chr> <chr> ## 1 Luke Skywalker 172 77 blond fair blue ## 2 C-3PO 167 75 <NA> gold yellow ## 3 R2-D2 96 32 <NA> white, blue red ## 4 Darth Vader 202 136 none white yellow ## 5 Leia Organa 150 49 brown light brown ## 6 Owen Lars 178 120 brown, grey light blue ## 7 Beru Whitesun lars 165 75 brown light blue ## 8 R5-D4 97 32 <NA> white, red red ## 9 Biggs Darklighter 183 84 black light brown ## 10 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray ## # ... with 77 more rows, and 7 more variables: birth_year <dbl>, ## # gender <chr>, homeworld <chr>, species <chr>, films <list>, ## # vehicles <list>, starships <list>
Take a glimpse
at the data:
glimpse(starwars)
## Observations: 87 ## Variables: 13 ## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", ... ## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188... ## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 8... ## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "b... ## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "l... ## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue",... ## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0... ## $ gender <chr> "male", NA, NA, "male", "female", "male", "female",... ## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alder... ## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human... ## $ films <list> [<"Revenge of the Sith", "Return of the Jedi", "Th... ## $ vehicles <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>,... ## $ starships <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Adva...
Run the following in the Console to view the help
?starwars
How many rows and columns does this dataset have? What does each row represent? What does each column represent?
Make a prediction: What relationship do you expect to see between height and mass?
geom_point()
)ggplot(data = starwars, aes(x = height, y = mass)) + geom_point()
## Warning: Removed 28 rows containing missing values (geom_point).
How would you describe this relationship? What other variables would help us understand data points that don't follow the overall trend?
Who is the not so tall but really chubby character?
Can display additional variables with
aesthetics (like shape, colour, size), or
faceting (small multiples displaying different subsets)
Visual characteristics of plotting characters that can be mapped to data are
color
size
shape
alpha
(transparency)
ggplot(data = starwars, aes(x = height, y = mass, color = gender)) + geom_point()
## Warning: Removed 28 rows containing missing values (geom_point).
aesthetics | discrete | continuous |
---|---|---|
color | rainbow of colors | gradient |
size | discrete steps | linear mapping between radius and value |
shape | different shape for each | shouldn't (and doesn't) work |
Smaller plots that display different subsets of the data
Useful for exploring conditional relationships and large data
ggplot(data = starwars, aes(x = height, y = mass)) + facet_grid(. ~ gender) + geom_point()
## Warning: Removed 28 rows containing missing values (geom_point).
In the next few slides describe what each plot displays. Think about how the code relates to the output.
ggplot(data = starwars, aes(x = height, y = mass)) + facet_grid(gender ~ .) + geom_point()
## Warning: Removed 28 rows containing missing values (geom_point).
ggplot(data = starwars, aes(x = height, y = mass)) + facet_grid(. ~ gender) + geom_point()
## Warning: Removed 28 rows containing missing values (geom_point).
ggplot(data = starwars, aes(x = height, y = mass)) + facet_wrap(~ eye_color) + geom_point()
## Warning: Removed 28 rows containing missing values (geom_point).
facet_grid()
: 2d grid, rows ~ cols, . for no split
facet_wrap()
: 1d ribbon wrapped into 2d
How are these plots similar? How are they different?
geom_smooth
To plot a smooth curve, use geom_smooth()
ggplot(data = starwars, aes(x = height, y = mass)) + geom_smooth()
Univariate data analysis - distribution of single variable
Bivariate data analysis - relationship between two variables
Multivariate data analysis - relationship between many variables at once, usually focusing on the relationship between two while conditioning for others
Numerical variables can be classified as continuous or discrete based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively.
If the variable is categorical, we can determine if it is ordinal based on whether or not the levels have a natural ordering.
mean
), median (median
), mode (not always useful)range
), standard deviation (sd
), inter-quartile range (IQR
)For numerical variables
ggplot(starwars, aes(x = height)) + geom_histogram(binwidth = 10)
## Warning: Removed 6 rows containing non-finite values (stat_bin).
For categorical variables
ggplot(starwars, aes(x = gender)) + geom_bar()