class: center, middle, inverse, title-slide # Tidy data and data wrangling ## Intro to Data Science ### Shawn Santo ### 01-23-20 --- ## Announcements - Homework 1 assigned today, due Jan 30 at 11:59pm - Tomorrow's lab is still to be done individually. However, you will receive your lab group. You should sit with your group and work on the lab. --- ## Get today's application exercise - Navigate to https://classroom.github.com/a/eOeCZ3wy - Clone your application exercise repo, appex03-[github_name] - Change your project's name in RStudio Cloud to appex03-[your_name] - Configure git in RStudio Cloud's console pane ```r library(usethis) use_git_config(user.name = "name", user.email = "email") ``` --- class: center, middle, inverse # Recall --- ## What is EDA? - Exploratory data analysis (EDA) is an approach to analyzing data sets by summarizing the main characteristics. - Often, EDA is visual. That's what we're focusing on today. - We can also calculate summary statistics and perform data wrangling/manipulation/transformation at (or before) this stage of the analysis. That's what we're focusing in the next class. --- ## Package `ggplot2` `ggplot()` is the main function in `ggplot2`, and plots are constructed in layers. -- Start with ```r *ggplot(data = [data], mapping = aes(x = [x-var], y = [y-var])) ``` -- Add a `geom` with ```r ggplot(data = [data], mapping = aes(x = [x-var], y = [y-var])) + * geom_[plot-type]() ``` -- Add other options with ```r ggplot(data = [data], mapping = aes(x = [x-var], y = [y-var])) + geom_[plot-type]() + * [options] ``` --- ## Types of variables - **Numerical variables** can be classified as **continuous** or **discrete** based on whether or not the variable is uncountable or countable. <br/><br/> - *height* is continuous <br/><br/> - *number of siblings* is discrete -- - If a variable is **categorical**, we can determine if it is **ordinal** based on whether or not the levels have a natural ordering. <br/><br/> - *hair color* is unordered <br/><br/> - *year in school* is ordinal --- ## Characterizing numerical distributions - **shape:** - skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail) - modality: unimodal, bimodal, multimodal, uniform <br/><br/> - **center:** mean (`mean`), median (`median`), mode (not always useful) <br/><br/> - **spread:** range (`range`), standard deviation (`sd`), inter-quartile range (`IQR`) <br/><br/> - **outliers**: observations outside of the usual pattern <br><br/> *In parentheses are the corresponding R function names.* --- ## Histograms ```r ggplot(data = starwars, mapping = aes(x = height)) + geom_histogram(binwidth = 10, fill = "purple", color = "black") + theme_minimal() ``` <img src="lec-03a-tidy-data-wrangle_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## Density plots ```r ggplot(data = starwars, mapping = aes(x = height)) + geom_density(fill = "lightblue") ``` <img src="lec-03a-tidy-data-wrangle_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ## Side-by-side box plots ```r ggplot(data = starwars, mapping = aes(y = height, x = gender)) + geom_boxplot() + theme_bw(base_size = 20) ``` <img src="lec-03a-tidy-data-wrangle_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- ## Bar plots ```r ggplot(data = starwars, mapping = aes(x = gender)) + geom_bar(fill = "#903079", alpha = 0.4) ``` <img src="lec-03a-tidy-data-wrangle_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- ## Segmented bar plots, counts ```r ggplot(data = starwars, mapping = aes(x = gender, fill = hair_color)) + geom_bar() ``` <img src="lec-03a-tidy-data-wrangle_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- class: center, middle, inverse # Tidy data --- ## Tidy data >Happy families are all alike; every unhappy family is unhappy in its own way. > >Leo Tolstoy -- <br/> **Characteristics of tidy data:** - Each variable forms a column. - Each observation forms a row. - Each type of observational unit forms a table. <img src="images/tidy.png"> --- ## Summary tables Is each of the following a dataset? Which is a summary table? .pull-left[ ``` #> # A tibble: 87 x 3 #> name height mass #> <chr> <int> <dbl> #> 1 Luke Skywalker 172 77 #> 2 C-3PO 167 75 #> 3 R2-D2 96 32 #> 4 Darth Vader 202 136 #> 5 Leia Organa 150 49 #> 6 Owen Lars 178 120 #> 7 Beru Whitesun lars 165 75 #> 8 R5-D4 97 32 #> 9 Biggs Darklighter 183 84 #> 10 Obi-Wan Kenobi 182 77 #> # … with 77 more rows ``` ] .pull-right[ ``` #> # A tibble: 10 x 2 #> species avg_height #> <chr> <dbl> #> 1 Aleena 79 #> 2 Besalisk 198 #> 3 Cerean 198 #> 4 Chagrian 196 #> 5 Clawdite 168 #> 6 Droid 140 #> 7 Dug 112 #> 8 Ewok 88 #> 9 Geonosian 183 #> 10 Gungan 209. ``` ] --- class: center, middle, inverse # Pipes --- ## Pipe operator: `%>%` The pipe operator is implemented in the package `magrittr`, it's pronounced "and then". This is available when you load package `tidyverse`. - You can think about the following sequence of actions - find keys, start car, drive to school, park. - Expressed as a set of nested functions in R pseudocode this would look like: ```r park(drive(start_car(find("keys")), to = "campus")) ``` -- - Writing it out using pipes give it a more natural (and easier to read) structure. ```r find("keys") %>% start_car() %>% drive(to = "campus") %>% park() ``` --- Similarly, rather than a series of nested functions ```r h(g(f(x), y = 1), z = 1) ``` We can express the code as ```r f(x) %>% g(y = 1) %>% h(z = 1) ``` -- A few notes: 1. By default, the pipe operator takes the object on the left-hand side and passes it into the function call as the value to the function's first argument. `x %>% f(y)` is equivalent to `f(x, y)`. 2. You can insert the pipe operator with keyboard shortcuts: - Mac: `command` + `shift` + `M` - Windows: `Ctrl` + `Shift` + `M` --- ## What about other arguments? To send results to a function argument other than first one or to use the previous result for multiple arguments, use a single `.`. For example, ```r starwars %>% filter(species == "Human") %>% lm(mass ~ height, data = .) ``` -- <br/><br/> Or similarly, `y %>% f(x, .)` is equivalent to `f(x, y)`. --- class: center, middle, inverse # Data wrangling --- ## NC bike crashes ```r library(tidyverse) ncbikecrash <- read_csv2("data/nc_bike_crash.csv") ``` .small-text[ ```r ncbikecrash ``` ``` #> # A tibble: 11,266 x 55 #> geo_point_2d Ambulance BikeAge BikeAgeGrp BikeAlcDrg BikeAlcFlg BikeDir #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 36.11201668… Yes 999 Unknown Missing Missing With T… #> 2 35.22622164… No 41 40-49 No No With T… #> 3 35.53254207… No 25 25-29 No No Facing… #> 4 35.15256263… No 18 16-19 No No Facing… #> 5 35.73139595… Yes 52 50-59 No No With T… #> 6 36.47096690… Yes 81 70+ No No With T… #> 7 35.54803712… Yes 19 16-19 No No Facing… #> 8 35.92714860… Yes 27 25-29 No No With T… #> 9 34.23649348… Yes 22 20-24 No No Not Ap… #> 10 35.03242192… Yes 50 50-59 No No With T… #> # … with 11,256 more rows, and 48 more variables: BikeInjury <chr>, #> # BikePos <chr>, BikeRace <chr>, BikeSex <chr>, City <chr>, #> # County <chr>, `Biker Intox.` <chr>, `Day of Week` <chr>, #> # CrashGrp <chr>, CrashHour <dbl>, CrashLoc <chr>, CrashMonth <chr>, #> # CrashSevr <chr>, CrashType <chr>, CrashYear <dbl>, Development <chr>, #> # DrvrAge <chr>, DrvrAgeGrp <chr>, DrvrAlcDrg <chr>, DrvrAlcFlg <chr>, #> # DrvrInjury <chr>, DrvrRace <chr>, DrvrSex <chr>, DrvrVehTyp <chr>, #> # HitRun <chr>, LightCond <chr>, Locality <chr>, NumBicsAin <chr>, #> # NumBicsBin <chr>, NumBicsCin <chr>, NumBicsKil <chr>, #> # NumBicsNoi <chr>, NumBicsTot <chr>, NumBicsUin <chr>, NumLanes <chr>, #> # NumUnits <dbl>, RdCharacte <chr>, RdClass <chr>, RdConditio <chr>, #> # RdConfig <chr>, RdDefects <chr>, RdFeature <chr>, RdSurface <chr>, #> # RuralUrban <chr>, SpeedLimit <chr>, TraffCntrl <chr>, Weather <chr>, #> # Workzone <chr> ``` ] *Source*: [Chapell Hill Open Data](https://www.chapelhillopendata.org/explore/dataset/bicycle-crash-data-chapel-hill-region/table/?disjunctive.city&disjunctive.county&disjunctive.crashday&disjunctive.crashsevr&disjunctive.crashyear) --- ## Variables ```r names(ncbikecrash) ``` ``` #> [1] "geo_point_2d" "Ambulance" "BikeAge" "BikeAgeGrp" #> [5] "BikeAlcDrg" "BikeAlcFlg" "BikeDir" "BikeInjury" #> [9] "BikePos" "BikeRace" "BikeSex" "City" #> [13] "County" "Biker Intox." "Day of Week" "CrashGrp" #> [17] "CrashHour" "CrashLoc" "CrashMonth" "CrashSevr" #> [21] "CrashType" "CrashYear" "Development" "DrvrAge" #> [25] "DrvrAgeGrp" "DrvrAlcDrg" "DrvrAlcFlg" "DrvrInjury" #> [29] "DrvrRace" "DrvrSex" "DrvrVehTyp" "HitRun" #> [33] "LightCond" "Locality" "NumBicsAin" "NumBicsBin" #> [37] "NumBicsCin" "NumBicsKil" "NumBicsNoi" "NumBicsTot" #> [41] "NumBicsUin" "NumLanes" "NumUnits" "RdCharacte" #> [45] "RdClass" "RdConditio" "RdConfig" "RdDefects" #> [49] "RdFeature" "RdSurface" "RuralUrban" "SpeedLimit" #> [53] "TraffCntrl" "Weather" "Workzone" ``` --- ## A Grammar of data manipulation Package `dplyr` is based on the concepts of functions as verbs that manipulate data frames. <center> <img src="img/02b/dplyr-part-of-tidyverse.png" height="500" width="450"> </center> --- ## Core `dplyr` functions | Function | Description | |---------------------------------:|:----------------------------| | `filter()` | pick rows matching criteria | | `slice()` | pick rows using index(es) | | `select()` | pick columns by name | | `pull()` | grab a column as a vector | | `arrange()` | reorder rows | | `mutate()` | add new variables | | `distinct()` | filter for unique rows | | `sample_n()` / `sample_frac()` | randomly sample rows | | `summarise()` | reduce variables to values | - First argument is *always* a data frame - Subsequent arguments say what to do with that data frame - The result is almost always a data frame --- ## `dplyr` functions and `%>%` Since the first argument is *always* a data frame in these core `dplyr` functions and the result is almost always a data frame, using `%>%` operator follows naturally and makes our code easy to read. <br/> Break lines at the end of `%>%` ```r ncbikecrash %>% select(BikeAge, City) %>% filter(City == "Durham") ``` -- <br/> This will result in an error. ```r ncbikecrash %>% select(BikeAge, City) %>% filter(City == "Durham") ``` ``` #> Error: <text>:2:3: unexpected SPECIAL #> 1: ncbikecrash #> 2: %>% #> ^ ``` --- ## `filter()` to select a subset of rows for crashes in Durham County ```r ncbikecrash %>% * filter(County == "Durham") ``` ``` #> # A tibble: 539 x 55 #> geo_point_2d Ambulance BikeAge BikeAgeGrp BikeAlcDrg BikeAlcFlg BikeDir #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 36.01054214… No 32 30-39 No No With T… #> 2 36.00874470… Yes 29 25-29 Yes-Alcoh… Yes With T… #> 3 36.01176620… Yes 16 16-19 No No With T… #> 4 35.99294000… Yes 11 11-15 . No Not Ap… #> 5 36.01285000… Yes 70+ 70+ . No With T… #> 6 36.03275299… Yes 13 11-15 . No With T… #> 7 36.017794,-… No 21 20-24 No No With T… #> 8 35.93371999… No 48 40-49 . No With T… #> 9 36.02745000… Yes 12 11-15 . No With T… #> 10 35.99289700… Yes 13 11-15 . No Not Ap… #> # … with 529 more rows, and 48 more variables: BikeInjury <chr>, #> # BikePos <chr>, BikeRace <chr>, BikeSex <chr>, City <chr>, #> # County <chr>, `Biker Intox.` <chr>, `Day of Week` <chr>, #> # CrashGrp <chr>, CrashHour <dbl>, CrashLoc <chr>, CrashMonth <chr>, #> # CrashSevr <chr>, CrashType <chr>, CrashYear <dbl>, Development <chr>, #> # DrvrAge <chr>, DrvrAgeGrp <chr>, DrvrAlcDrg <chr>, DrvrAlcFlg <chr>, #> # DrvrInjury <chr>, DrvrRace <chr>, DrvrSex <chr>, DrvrVehTyp <chr>, #> # HitRun <chr>, LightCond <chr>, Locality <chr>, NumBicsAin <chr>, #> # NumBicsBin <chr>, NumBicsCin <chr>, NumBicsKil <chr>, #> # NumBicsNoi <chr>, NumBicsTot <chr>, NumBicsUin <chr>, NumLanes <chr>, #> # NumUnits <dbl>, RdCharacte <chr>, RdClass <chr>, RdConditio <chr>, #> # RdConfig <chr>, RdDefects <chr>, RdFeature <chr>, RdSurface <chr>, #> # RuralUrban <chr>, SpeedLimit <chr>, TraffCntrl <chr>, Weather <chr>, #> # Workzone <chr> ``` ??? In R, operators ` = ` and ` <- ` are reserved for assignment (creating new variables / objects). When we use ` == `, we are checking for equality. --- ## `filter()` for many conditions at once for crashes in Durham County where biker was 30-39 years old ```r ncbikecrash %>% * filter(County == "Durham", BikeAgeGrp == "30-39") ``` ``` #> # A tibble: 81 x 55 #> geo_point_2d Ambulance BikeAge BikeAgeGrp BikeAlcDrg BikeAlcFlg BikeDir #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 36.01054214… No 32 30-39 No No With T… #> 2 35.90046288… Yes 35 30-39 No No With T… #> 3 35.99427760… No 30 30-39 No No With T… #> 4 35.96388600… No 30 30-39 . No Facing… #> 5 36.00163,-7… No 32 30-39 . No With T… #> 6 35.87332758… Yes 36 30-39 No No With T… #> 7 35.90216100… Yes 30 30-39 No No Not Ap… #> 8 35.92094823… Yes 33 30-39 No No With T… #> 9 35.96158571… Yes 35 30-39 No No With T… #> 10 35.99913699… No 32 30-39 No No With T… #> # … with 71 more rows, and 48 more variables: BikeInjury <chr>, #> # BikePos <chr>, BikeRace <chr>, BikeSex <chr>, City <chr>, #> # County <chr>, `Biker Intox.` <chr>, `Day of Week` <chr>, #> # CrashGrp <chr>, CrashHour <dbl>, CrashLoc <chr>, CrashMonth <chr>, #> # CrashSevr <chr>, CrashType <chr>, CrashYear <dbl>, Development <chr>, #> # DrvrAge <chr>, DrvrAgeGrp <chr>, DrvrAlcDrg <chr>, DrvrAlcFlg <chr>, #> # DrvrInjury <chr>, DrvrRace <chr>, DrvrSex <chr>, DrvrVehTyp <chr>, #> # HitRun <chr>, LightCond <chr>, Locality <chr>, NumBicsAin <chr>, #> # NumBicsBin <chr>, NumBicsCin <chr>, NumBicsKil <chr>, #> # NumBicsNoi <chr>, NumBicsTot <chr>, NumBicsUin <chr>, NumLanes <chr>, #> # NumUnits <dbl>, RdCharacte <chr>, RdClass <chr>, RdConditio <chr>, #> # RdConfig <chr>, RdDefects <chr>, RdFeature <chr>, RdSurface <chr>, #> # RuralUrban <chr>, SpeedLimit <chr>, TraffCntrl <chr>, Weather <chr>, #> # Workzone <chr> ``` --- ## Logical operators in R | Operator | Operation | |-----------------------------|--------------------------| | `x < y` | less than | | `x > y` | greater than | | `x <= y` | less than or equal to | | `x >= y` | greater than or equal to | | `x != y` | not equal to | | `x == y` | equal to | | `x %in% y` | group membership | | `x` | `y` | or | | `x & y` | and | | `!x` | not | <br/><br/> Examples are in the presentation notes. Hit `p` when viewing the slides. ??? ## Examples of logical operators ```r 5 > 6 ``` ``` #> [1] FALSE ``` ```r 10 < -11 ``` ``` #> [1] FALSE ``` ```r 4 >= 4 ``` ``` #> [1] TRUE ``` ```r 1 == 1 ``` ``` #> [1] TRUE ``` ```r 3 == 3.4 ``` ``` #> [1] FALSE ``` ```r c(3, 4) == c(3, 10) ``` ``` #> [1] TRUE FALSE ``` ```r c(-4, 4, 0) %in% c(2, 5, 1, 4, 2, 0, 0, -100) ``` ``` #> [1] FALSE TRUE TRUE ``` ```r c(2, 5, 1, 4, 2, 0, 0, -100) %in% c(-4, 4, 0) ``` ``` #> [1] FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE ``` ```r TRUE & TRUE ``` ``` #> [1] TRUE ``` ```r TRUE & FALSE ``` ``` #> [1] FALSE ``` ```r TRUE | FALSE ``` ``` #> [1] TRUE ``` ```r FALSE | FALSE ``` ``` #> [1] FALSE ``` ```r !(c(3, 4) == c(3, 10)) ``` ``` #> [1] FALSE TRUE ``` --- ## `select()` to keep variables ```r ncbikecrash %>% filter(County == "Durham", BikeAgeGrp == "30-39") %>% * select(Locality, Weather) ``` ``` #> # A tibble: 81 x 2 #> Locality Weather #> <chr> <chr> #> 1 Urban (>70% Developed) Clear #> 2 Mixed (30% To 70% Developed) Clear #> 3 Urban (>70% Developed) Clear #> 4 Urban (>70% Developed) Clear #> 5 Urban (>70% Developed) Clear #> 6 Rural (<30% Developed) Clear #> 7 Urban (>70% Developed) Clear #> 8 Rural (<30% Developed) Clear #> 9 Mixed (30% To 70% Developed) Cloudy #> 10 Urban (>70% Developed) Clear #> # … with 71 more rows ``` --- ## `select()` to exclude variables ```r ncbikecrash %>% * select(-geo_point_2d) ``` ``` #> # A tibble: 11,266 x 54 #> Ambulance BikeAge BikeAgeGrp BikeAlcDrg BikeAlcFlg BikeDir BikeInjury #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 Yes 999 Unknown Missing Missing With T… Unknown I… #> 2 No 41 40-49 No No With T… C: Possib… #> 3 No 25 25-29 No No Facing… C: Possib… #> 4 No 18 16-19 No No Facing… O: No Inj… #> 5 Yes 52 50-59 No No With T… C: Possib… #> 6 Yes 81 70+ No No With T… K: Killed #> 7 Yes 19 16-19 No No Facing… C: Possib… #> 8 Yes 27 25-29 No No With T… C: Possib… #> 9 Yes 22 20-24 No No Not Ap… C: Possib… #> 10 Yes 50 50-59 No No With T… C: Possib… #> # … with 11,256 more rows, and 47 more variables: BikePos <chr>, #> # BikeRace <chr>, BikeSex <chr>, City <chr>, County <chr>, `Biker #> # Intox.` <chr>, `Day of Week` <chr>, CrashGrp <chr>, CrashHour <dbl>, #> # CrashLoc <chr>, CrashMonth <chr>, CrashSevr <chr>, CrashType <chr>, #> # CrashYear <dbl>, Development <chr>, DrvrAge <chr>, DrvrAgeGrp <chr>, #> # DrvrAlcDrg <chr>, DrvrAlcFlg <chr>, DrvrInjury <chr>, DrvrRace <chr>, #> # DrvrSex <chr>, DrvrVehTyp <chr>, HitRun <chr>, LightCond <chr>, #> # Locality <chr>, NumBicsAin <chr>, NumBicsBin <chr>, NumBicsCin <chr>, #> # NumBicsKil <chr>, NumBicsNoi <chr>, NumBicsTot <chr>, #> # NumBicsUin <chr>, NumLanes <chr>, NumUnits <dbl>, RdCharacte <chr>, #> # RdClass <chr>, RdConditio <chr>, RdConfig <chr>, RdDefects <chr>, #> # RdFeature <chr>, RdSurface <chr>, RuralUrban <chr>, SpeedLimit <chr>, #> # TraffCntrl <chr>, Weather <chr>, Workzone <chr> ``` --- ## `select()` a range of variables ```r ncbikecrash %>% * select(Ambulance:BikeInjury) ``` ``` #> # A tibble: 11,266 x 7 #> Ambulance BikeAge BikeAgeGrp BikeAlcDrg BikeAlcFlg BikeDir BikeInjury #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 Yes 999 Unknown Missing Missing With Tra… Unknown In… #> 2 No 41 40-49 No No With Tra… C: Possibl… #> 3 No 25 25-29 No No Facing T… C: Possibl… #> 4 No 18 16-19 No No Facing T… O: No Inju… #> 5 Yes 52 50-59 No No With Tra… C: Possibl… #> 6 Yes 81 70+ No No With Tra… K: Killed #> 7 Yes 19 16-19 No No Facing T… C: Possibl… #> 8 Yes 27 25-29 No No With Tra… C: Possibl… #> 9 Yes 22 20-24 No No Not Appl… C: Possibl… #> 10 Yes 50 50-59 No No With Tra… C: Possibl… #> # … with 11,256 more rows ``` --- ## `slice()` for certain rows by index(es) ```r ncbikecrash %>% select(BikeAlcFlg, BikeAgeGrp, City) %>% * slice(1:3) ``` ``` #> # A tibble: 3 x 3 #> BikeAlcFlg BikeAgeGrp City #> <chr> <chr> <chr> #> 1 Missing Unknown Greensboro #> 2 No 40-49 Charlotte #> 3 No 25-29 Selma ``` -- to get the last 3 rows ```r last_row <- nrow(ncbikecrash) ncbikecrash %>% select(BikeAlcFlg, BikeAgeGrp, City) %>% * slice((last_row - 2):last_row) ``` ``` #> # A tibble: 3 x 3 #> BikeAlcFlg BikeAgeGrp City #> <chr> <chr> <chr> #> 1 No 11-15 Salisbury #> 2 No 60-69 Wilmington #> 3 No 50-59 None - Rural Crash ``` --- ## `pull()` to extract a column as a vector ```r ncbikecrash %>% slice(1:6) %>% * pull(Locality) ``` ``` #> [1] "Urban (>70% Developed)" "Urban (>70% Developed)" #> [3] "Urban (>70% Developed)" "Urban (>70% Developed)" #> [5] "Rural (<30% Developed)" "Rural (<30% Developed)" ``` -- vs. ```r ncbikecrash %>% slice(1:6) %>% * select(Locality) ``` ``` #> # A tibble: 6 x 1 #> Locality #> <chr> #> 1 Urban (>70% Developed) #> 2 Urban (>70% Developed) #> 3 Urban (>70% Developed) #> 4 Urban (>70% Developed) #> 5 Rural (<30% Developed) #> 6 Rural (<30% Developed) ``` --- ## `sample_n()`, `sample_frac()` for a random sample To randomly sample 5 observations ```r ncbikecrash_n5 <- ncbikecrash %>% * sample_n(5, replace = FALSE) dim(ncbikecrash_n5) # check the dimensions of data frame ``` ``` #> [1] 5 55 ``` -- To randomly sample 20% of all observations ```r ncbikecrash_perc20 <-ncbikecrash %>% * sample_frac(0.2, replace = FALSE) dim(ncbikecrash_perc20) # check the dimensions of data frame ``` ``` #> [1] 2253 55 ``` --- ## `distinct()` to filter for unique rows And `arrange()` to order alphabetically ```r ncbikecrash %>% select(County, City) %>% * distinct() %>% * arrange(County, City) ``` ``` #> # A tibble: 430 x 2 #> County City #> <chr> <chr> #> 1 Alamance Alamance #> 2 Alamance Burlington #> 3 Alamance Elon #> 4 Alamance Elon College #> 5 Alamance Gibsonville #> 6 Alamance Graham #> 7 Alamance Green Level #> 8 Alamance Haw River #> 9 Alamance Mebane #> 10 Alamance None - Rural Crash #> # … with 420 more rows ``` --- ## `summarise()` to reduce variables to values ```r ncbikecrash %>% * summarise(avg_hr = mean(CrashHour)) ``` ``` #> # A tibble: 1 x 1 #> avg_hr #> <dbl> #> 1 14.6 ``` --- ## `group_by()` to perform calculations on group levels ```r ncbikecrash %>% * group_by(HitRun) %>% summarise(avg_hr = mean(CrashHour)) ``` ``` #> # A tibble: 2 x 2 #> HitRun avg_hr #> <chr> <dbl> #> 1 No 14.5 #> 2 Yes 14.9 ``` --- ## `count()` observations in groups ```r ncbikecrash %>% * count(DrvrAlcDrg) ``` ``` #> # A tibble: 10 x 2 #> DrvrAlcDrg n #> <chr> <int> #> 1 . 2830 #> 2 Missing 945 #> 3 No 7001 #> 4 Unknown 353 #> 5 Yes-Alcohol and Drugs, impairment detected 1 #> 6 Yes-Alcohol and Drugs, impairment suspected 10 #> 7 Yes-Alcohol, impairment detected 31 #> 8 Yes-Alcohol, impairment suspected 76 #> 9 Yes-Drugs, impairment detected 5 #> 10 Yes-Drugs, impairment suspected 14 ``` --- ## `mutate()` to add new variables ```r ncbikecrash %>% filter(DrvrAge != "70+", DrvrAge != "999") %>% * mutate(DrvrAge = as.numeric(DrvrAge), * DrvrRisk = ifelse(DrvrAge < 26, "High", "Low")) ``` -- What does function `ifelse()` do? Consider a small example. ```r ages <- c(14, 10, 33, 22, 8, 54, 14) ages ``` ``` #> [1] 14 10 33 22 8 54 14 ``` -- ```r ifelse(ages < 26, "High", "Low") ``` ``` #> [1] "High" "High" "Low" "High" "High" "Low" "High" ``` -- <br/> ```r ncbikecrash %>% select(DrvrAge, DrvrRisk) ``` ``` #> Error in .f(.x[[i]], ...): object 'DrvrRisk' not found ``` --- ## "Save" when you `mutate()` Often when you define a new variable with `mutate()` you'll also want to save the resulting data frame. ```r ncbikecrash_fix_age <- ncbikecrash %>% filter(DrvrAge != "70+", DrvrAge != "999") %>% * mutate(DrvrAge = as.numeric(DrvrAge), * DrvrRisk = ifelse(DrvrAge < 26, "High", "Low")) ``` -- ```r ncbikecrash_fix_age %>% select(DrvrAge, DrvrRisk) %>% sample_n(10) ``` ``` #> # A tibble: 10 x 2 #> DrvrAge DrvrRisk #> <dbl> <chr> #> 1 59 Low #> 2 49 Low #> 3 35 Low #> 4 55 Low #> 5 35 Low #> 6 67 Low #> 7 61 Low #> 8 65 Low #> 9 49 Low #> 10 44 Low ``` --- ## Did that work? ```r ncbikecrash_fix_age %>% group_by(DrvrAge, DrvrRisk) %>% count() ``` ``` #> # A tibble: 86 x 3 #> # Groups: DrvrAge, DrvrRisk [86] #> DrvrAge DrvrRisk n #> <dbl> <chr> <int> #> 1 9 High 1 #> 2 12 High 1 #> 3 13 High 3 #> 4 14 High 1 #> 5 15 High 9 #> 6 16 High 88 #> 7 17 High 139 #> 8 18 High 183 #> 9 19 High 221 #> 10 20 High 251 #> # … with 76 more rows ``` --- ## Application exercise - Navigate to https://classroom.github.com/a/eOeCZ3wy - Clone your application exercise repo, appex03-[github_name] - Change your project's name in RStudio Cloud to appex03-[your_name] - Configure git in RStudio Cloud's console pane ```r library(usethis) use_git_config(user.name = "name", user.email = "email") ``` ??? ## Task 1 Identify the most common driver speed limit for when bike crashes occur. ```r ncbikecrash %>% count(SpeedLimit) %>% arrange(desc(n)) %>% slice(1) ``` ``` #> # A tibble: 1 x 2 #> SpeedLimit n #> <chr> <int> #> 1 30 - 35 MPH 4488 ``` ## Task 2 Filter `ncbikecrash` for crashes in residential areas where the driver age group is 0-19. ```r ncbikecrash %>% filter(Development == "Residential", DrvrAgeGrp == "0-19") ``` ``` #> # A tibble: 306 x 55 #> geo_point_2d Ambulance BikeAge BikeAgeGrp BikeAlcDrg BikeAlcFlg BikeDir #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 34.79133149… No 8 6-10 No No Not Ap… #> 2 35.59038000… No 14 11-15 . No Facing… #> 3 35.66393999… Yes 6 6-10 . No Not Ap… #> 4 35.24733400… Yes 10 6-10 . No Unknown #> 5 35.76368269… No 7 6-10 No No Not Ap… #> 6 35.48880441… Yes 45 40-49 No No With T… #> 7 35.10998499… Yes 10 6-10 . No Facing… #> 8 34.723793,-… Yes 42 40-49 . No Facing… #> 9 36.03639980… Yes 16 16-19 No No With T… #> 10 35.71809000… No 44 40-49 . No With T… #> # … with 296 more rows, and 48 more variables: BikeInjury <chr>, #> # BikePos <chr>, BikeRace <chr>, BikeSex <chr>, City <chr>, #> # County <chr>, `Biker Intox.` <chr>, `Day of Week` <chr>, #> # CrashGrp <chr>, CrashHour <dbl>, CrashLoc <chr>, CrashMonth <chr>, #> # CrashSevr <chr>, CrashType <chr>, CrashYear <dbl>, Development <chr>, #> # DrvrAge <chr>, DrvrAgeGrp <chr>, DrvrAlcDrg <chr>, DrvrAlcFlg <chr>, #> # DrvrInjury <chr>, DrvrRace <chr>, DrvrSex <chr>, DrvrVehTyp <chr>, #> # HitRun <chr>, LightCond <chr>, Locality <chr>, NumBicsAin <chr>, #> # NumBicsBin <chr>, NumBicsCin <chr>, NumBicsKil <chr>, #> # NumBicsNoi <chr>, NumBicsTot <chr>, NumBicsUin <chr>, NumLanes <chr>, #> # NumUnits <dbl>, RdCharacte <chr>, RdClass <chr>, RdConditio <chr>, #> # RdConfig <chr>, RdDefects <chr>, RdFeature <chr>, RdSurface <chr>, #> # RuralUrban <chr>, SpeedLimit <chr>, TraffCntrl <chr>, Weather <chr>, #> # Workzone <chr> ``` ## Task 3 What is the mean hour for when a crash occurs in Durham County for each month? ```r ncbikecrash %>% filter(County == "Durham") %>% group_by(CrashMonth) %>% summarise(mean_crash_hour = mean(CrashHour)) ``` ``` #> # A tibble: 12 x 2 #> CrashMonth mean_crash_hour #> <chr> <dbl> #> 1 April 14.6 #> 2 August 13.1 #> 3 December 14.6 #> 4 February 15.9 #> 5 January 14 #> 6 July 15.1 #> 7 June 15.5 #> 8 March 15 #> 9 May 14.1 #> 10 November 14.6 #> 11 October 14.5 #> 12 September 14.5 ``` ## Task 4 Plot your result from Task 3. ```r ncbikecrash %>% filter(County == "Durham") %>% group_by(CrashMonth) %>% summarise(mean_crash_hour = mean(CrashHour)) %>% ggplot(mapping = aes(x = CrashMonth, y = mean_crash_hour)) + geom_point() ``` <img src="lec-03a-tidy-data-wrangle_files/figure-html/task-4-1.png" style="display: block; margin: auto;" /> --- ## References 1. Bicycle Crashes. (2020). Chapelhillopendata.org. Retrieved 21 January 2020, from https://www.chapelhillopendata.org/explore/dataset/bicycle-crash-data-chapel-hill-region/information/?disjunctive.city&disjunctive.county&disjunctive.crashday&disjunctive.crashsevr&disjunctive.crashyear