class: center, middle, inverse, title-slide # Tidy data and data wrangling ### Dr. Maria Tackett ### 01.23.19 --- layout: true <div class="my-footer"> <span> <a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- ## Eli Review Signup: 1. Go to [https://app.elireview.com](https://app.elireview.com). 2. Click the “New to Eli Review? Sign up!” button to create your account. - Be sure to choose the “Student” option. - Consider using your school email, but any address will work. Join our course: In the box labeled “Ready to join a course?” enter this course code: **hyper425bend** --- ## Check in - Any questions on material from last time? - Any questions on the lab? - Any questions on workflow / course structure? --- class: center, middle # Identifying variables --- ## Number of variables involved - .vocab[Univariate data analysis]: distribution of single variable - .vocab[Bivariate data analysis]: relationship between two variables - .vocab[Multivariate data analysis]: relationship between many variables at once, usually focusing on the relationship between two while conditioning for others --- ## Types of variables - .vocab[Numerical variables] can be classified as .vocab[continuous] or .vocab[discrete] based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively. - *height* is continuous - *number of siblings* is discrete -- - If the variable is .vocab[categorical], we can determine if it is .vocab[ordinal] based on whether or not the levels have a natural ordering. - *hair color* is unordered - *year in school* is ordinal --- class: center, middle # Visualizing numerical data --- ### Describing shapes of numerical distributions - **shape:** - skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail) - modality: unimodal, bimodal, multimodal, uniform - **center:** mean (**`mean`**), median (**`median`**), mode (not always useful) - **spread:** range (**`range`**), standard deviation (**`sd`**), inter-quartile range (**`IQR`**) - **outliers**: observations outside of the usual pattern --- ## Histograms .small[ ```r ggplot(data = starwars, mapping = aes(x = height)) + geom_histogram(binwidth = 10) ``` ``` ## Warning: Removed 6 rows containing non-finite values (stat_bin). ``` <img src="2b-tidy-data-wrangle_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> ] --- ## Density plots .small[ ```r ggplot(data = starwars, mapping = aes(x = height)) + geom_density() ``` ``` ## Warning: Removed 6 rows containing non-finite values (stat_density). ``` <img src="2b-tidy-data-wrangle_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> ] --- ## Side-by-side box plots .small[ ```r ggplot(data = starwars, mapping = aes(y = height, x = gender)) + geom_boxplot() ``` ``` ## Warning: Removed 6 rows containing non-finite values (stat_boxplot). ``` <img src="2b-tidy-data-wrangle_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] --- class: center, middle ## Visualizing categorical data --- ## Bar plots .small[ ```r ggplot(data = starwars, mapping = aes(x = gender)) + geom_bar() ``` <img src="2b-tidy-data-wrangle_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] --- ### Segmented bar plots, counts .small[ ```r ggplot(data = starwars, mapping = aes(x = gender, fill = hair_color)) + geom_bar() ``` <img src="2b-tidy-data-wrangle_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> ] --- ### Segmented bar plots, proportions .small[ ```r ggplot(data = starwars, mapping = aes(x = gender, fill = hair_color)) + * geom_bar(position = "fill") + labs(y = "proportion") ``` <img src="2b-tidy-data-wrangle_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> ] --- .question[ Which bar plot is a more useful representation for visualizing the relationship between gender and hair color? Why? ] .pull-left[ <!-- --> ] .pull-right[ <!-- --> ] --- class: center, middle # Tidy data --- ## Tidy data >Happy families are all alike; every unhappy family is unhappy in its own way. > >Leo Tolstoy -- .pull-left[ **Characteristics of tidy data:** - Each variable forms a column. - Each observation forms a row. - Each type of observational unit forms a table. ] .pull-right[ **Characteristics of untidy data:** !@#$%^&*() ] --- ## Summary tables .question[ Is each of the following a dataset or a summary table? ] .small[ .pull-left[ ``` ## # A tibble: 87 x 3 ## name height mass ## <chr> <int> <dbl> ## 1 Luke Skywalker 172 77 ## 2 C-3PO 167 75 ## 3 R2-D2 96 32 ## 4 Darth Vader 202 136 ## 5 Leia Organa 150 49 ## 6 Owen Lars 178 120 ## 7 Beru Whitesun lars 165 75 ## 8 R5-D4 97 32 ## 9 Biggs Darklighter 183 84 ## 10 Obi-Wan Kenobi 182 77 ## # … with 77 more rows ``` ] .pull-right[ ``` ## # A tibble: 5 x 2 ## gender avg_height ## <chr> <dbl> ## 1 female 165. ## 2 hermaphrodite 175 ## 3 male 179. ## 4 none 200 ## 5 <NA> 120 ``` ] ] --- class: center, middle # Pipes --- ## Where does the name come from? The pipe operator is implemented in the package **magrittr**, it's pronounced "and then". .pull-left[  ] .pull-right[  ] .small[ [https://en.wikipedia.org/wiki/The_Treachery_of_Images](https://en.wikipedia.org/wiki/The_Treachery_of_Images) ] --- ## Review: How does a pipe work? - You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park. - Expressed as a set of nested functions in R pseudocode this would look like: ```r park(drive(start_car(find("keys")), to = "campus")) ``` -- - Writing it out using pipes give it a more natural (and easier to read) structure: ```r find("keys") %>% start_car() %>% drive(to = "campus") %>% park() ``` --- ## What about other arguments? To send results to a function argument other than first one or to use the previous result for multiple arguments, use "`.`" ```r starwars %>% filter(species == "Human") %>% lm(mass ~ height, data = .) ``` ``` ## ## Call: ## lm(formula = mass ~ height, data = .) ## ## Coefficients: ## (Intercept) height ## -116.58 1.11 ``` --- class: center, middle # Data wrangling --- ## Bike crashes in NC 2007 - 2014 The dataset is in the **dsbox** package: ```r library(dsbox) ncbikecrash ``` - The dataset contains all North Carolina bike crash data from 2007-2014. - Data downloaded on Sep 6, 2018. --- ## Variables View the names of variables via .tiny[ ```r names(ncbikecrash) ``` ``` ## [1] "object_id" "city" "county" ## [4] "region" "development" "locality" ## [7] "on_road" "rural_urban" "speed_limit" ## [10] "traffic_control" "weather" "workzone" ## [13] "bike_age" "bike_age_group" "bike_alcohol" ## [16] "bike_alcohol_drugs" "bike_direction" "bike_injury" ## [19] "bike_position" "bike_race" "bike_sex" ## [22] "driver_age" "driver_age_group" "driver_alcohol" ## [25] "driver_alcohol_drugs" "driver_est_speed" "driver_injury" ## [28] "driver_race" "driver_sex" "driver_vehicle_type" ## [31] "crash_alcohol" "crash_date" "crash_day" ## [34] "crash_group" "crash_hour" "crash_location" ## [37] "crash_month" "crash_severity" "crash_time" ## [40] "crash_type" "crash_year" "ambulance_req" ## [43] "hit_run" "light_condition" "road_character" ## [46] "road_class" "road_condition" "road_configuration" ## [49] "road_defects" "road_feature" "road_surface" ## [52] "num_bikes_ai" "num_bikes_bi" "num_bikes_ci" ## [55] "num_bikes_ki" "num_bikes_no" "num_bikes_to" ## [58] "num_bikes_ui" "num_lanes" "num_units" ## [61] "distance_mi_from" "frm_road" "rte_invd_cd" ## [64] "towrd_road" "geo_point" "geo_shape" ``` ] and see detailed descriptions with `?ncbikecrash`. --- ## Viewing your data - In the Environment, after loading with `data(ncbikecrash)`, and click on the name of the data frame to view it in the data viewer - Use the `glimpse` function to take a peek ```r glimpse(ncbikecrash) ``` ``` ## Observations: 7,467 ## Variables: 66 ## $ object_id <int> 1686, 1674, 1673, 1687, 1653, 1665, 1642, 1… ## $ city <chr> "None - Rural Crash", "Henderson", "None - … ## $ county <chr> "Wayne", "Vance", "Lincoln", "Columbus", "N… ## $ region <chr> "Coastal", "Piedmont", "Piedmont", "Coastal… ## $ development <chr> "Farms, Woods, Pastures", "Residential", "F… ## $ locality <chr> "Rural (<30% Developed)", "Mixed (30% To 70… ## $ on_road <chr> "SR 1915", "NICHOLAS ST", "US 321", "W BURK… ## $ rural_urban <chr> "Rural", "Urban", "Rural", "Urban", "Urban"… ## $ speed_limit <chr> "50 - 55 MPH", "30 - 35 MPH", "50 - 55 M… ## $ traffic_control <chr> "No Control Present", "Stop Sign", "Double … ## $ weather <chr> "Clear", "Clear", "Clear", "Rain", "Clear",… ## $ workzone <chr> "No", "No", "No", "No", "No", "No", "No", "… ## $ bike_age <chr> "52", "66", "33", "52", "22", "15", "41", "… ## $ bike_age_group <chr> "50-59", "60-69", "30-39", "50-59", "20-24"… ## $ bike_alcohol <chr> "No", "No", "No", "Yes", "No", "No", "No", … ## $ bike_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… ## $ bike_direction <chr> "With Traffic", "With Traffic", "With Traff… ## $ bike_injury <chr> "B: Evident Injury", "C: Possible Injury", … ## $ bike_position <chr> "Bike Lane / Paved Shoulder", "Travel Lane"… ## $ bike_race <chr> "Black", "Black", "White", "Black", "White"… ## $ bike_sex <chr> "Male", "Male", "Male", "Male", "Female", "… ## $ driver_age <chr> "34", NA, "37", "55", "25", "17", NA, "50",… ## $ driver_age_group <chr> "30-39", NA, "30-39", "50-59", "25-29", "0-… ## $ driver_alcohol <chr> "No", "Missing", "No", "No", "No", "No", "M… ## $ driver_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… ## $ driver_est_speed <chr> "51-55 mph", "6-10 mph", "41-45 mph", "11-1… ## $ driver_injury <chr> "O: No Injury", "Unknown Injury", "O: No In… ## $ driver_race <chr> "White", "Unknown/Missing", "Hispanic", "Bl… ## $ driver_sex <chr> "Male", NA, "Female", "Male", "Male", "Fema… ## $ driver_vehicle_type <chr> "Single Unit Truck (2-Axle, 6-Tire)", NA, "… ## $ crash_alcohol <chr> "No", "No", "No", "Yes", "No", "No", "No", … ## $ crash_date <chr> "11DEC2013", "20NOV2013", "03NOV2013", "14D… ## $ crash_day <chr> "Wednesday", "Wednesday", "Sunday", "Saturd… ## $ crash_group <chr> "Motorist Overtaking Bicyclist", "Bicyclist… ## $ crash_hour <int> 6, 20, 18, 18, 13, 17, 17, 7, 15, 2, 12, 22… ## $ crash_location <chr> "Non-Intersection", "Intersection", "Non-In… ## $ crash_month <chr> "December", "November", "November", "Decemb… ## $ crash_severity <chr> "B: Evident Injury", "C: Possible Injury", … ## $ crash_time <time> 06:10:00, 20:41:00, 18:05:00, 18:34:00, 13… ## $ crash_type <chr> "Motorist Overtaking - Undetected Bicyclist… ## $ crash_year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2… ## $ ambulance_req <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Y… ## $ hit_run <chr> "No", "Yes", "No", "No", "No", "No", "Yes",… ## $ light_condition <chr> "Dark - Roadway Not Lighted", NA, "Dark - R… ## $ road_character <chr> "Straight - Level", "Straight - Level", "St… ## $ road_class <chr> "State Secondary Route", "Local Street", "U… ## $ road_condition <chr> "Dry", "Dry", "Dry", "Water (Standing, Movi… ## $ road_configuration <chr> "Two-Way, Not Divided", "Two-Way, Divided, … ## $ road_defects <chr> "None", NA, "None", "None", "None", "None",… ## $ road_feature <chr> "No Special Feature", "T-Intersection", "No… ## $ road_surface <chr> "Coarse Asphalt", "Smooth Asphalt", "Smooth… ## $ num_bikes_ai <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ num_bikes_bi <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ num_bikes_ci <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ num_bikes_ki <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ num_bikes_no <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ num_bikes_to <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ num_bikes_ui <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ num_lanes <chr> "2 lanes", "2 lanes", "2 lanes", "1 lane", … ## $ num_units <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2… ## $ distance_mi_from <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0"… ## $ frm_road <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… ## $ rte_invd_cd <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ towrd_road <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… ## $ geo_point <chr> "35.3336070056, -77.9955023901", "36.315187… ## $ geo_shape <chr> "{\"type\": \"Point\", \"coordinates\": [-7… ``` --- ### A Grammar of Data Manipulation **dplyr** is based on the concepts of functions as verbs that manipulate data frames. .pull-left[  ] .pull-right[ .tiny[ - `filter`: pick rows matching criteria - `slice`: pick rows using index(es) - `select`: pick columns by name - `pull`: grab a column as a vector - `arrange`: reorder rows ] ] --- ### A Grammar of Data Manipulation **dplyr** is based on the concepts of functions as verbs that manipulate data frames. .pull-left[  ] .pull-right[ .tiny[ - `mutate`: add new variables - `distinct`: filter for unique rows - `sample_n` / `sample_frac`: randomly sample rows - `summarise`: reduce variables to values - ... (many more) ] ] --- ## **dplyr** rules for functions - First argument is *always* a data frame - Subsequent arguments say what to do with that data frame - Always return a data frame - Don't modify in place --- ## A note on piping and layering - The `%>%` operator in **dplyr** functions is called the pipe operator. This means you "pipe" the output of the previous line of code as the first input of the next line of code. <br> - The `+` operator in **ggplot2** functions is used for "layering". This means you create the plot in layers, separated by `+`. --- ### `filter` to select a subset of rows for crashes in Durham County ```r ncbikecrash %>% * filter(county == "Durham") ``` ``` ## # A tibble: 340 x 66 ## object_id city county region development locality on_road rural_urban ## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 2452 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## 2 2441 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 3 2466 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 4 549 Durh… Durham Piedm… Residential Urban (… PARK A… Urban ## 5 598 Durh… Durham Piedm… Residential Urban (… BELT S… Urban ## 6 603 Durh… Durham Piedm… Residential Urban (… HINSON… Urban ## 7 3974 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 8 7134 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 9 1670 Durh… Durham Piedm… Commercial Urban (… INFINI… Urban ## 10 1773 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## # … with 330 more rows, and 58 more variables: speed_limit <chr>, ## # traffic_control <chr>, weather <chr>, workzone <chr>, bike_age <chr>, ## # bike_age_group <chr>, bike_alcohol <chr>, bike_alcohol_drugs <chr>, ## # bike_direction <chr>, bike_injury <chr>, bike_position <chr>, ## # bike_race <chr>, bike_sex <chr>, driver_age <chr>, ## # driver_age_group <chr>, driver_alcohol <chr>, ## # driver_alcohol_drugs <chr>, driver_est_speed <chr>, ## # driver_injury <chr>, driver_race <chr>, driver_sex <chr>, ## # driver_vehicle_type <chr>, crash_alcohol <chr>, crash_date <chr>, ## # crash_day <chr>, crash_group <chr>, crash_hour <int>, ## # crash_location <chr>, crash_month <chr>, crash_severity <chr>, ## # crash_time <time>, crash_type <chr>, crash_year <int>, ## # ambulance_req <chr>, hit_run <chr>, light_condition <chr>, ## # road_character <chr>, road_class <chr>, road_condition <chr>, ## # road_configuration <chr>, road_defects <chr>, road_feature <chr>, ## # road_surface <chr>, num_bikes_ai <int>, num_bikes_bi <int>, ## # num_bikes_ci <int>, num_bikes_ki <int>, num_bikes_no <int>, ## # num_bikes_to <int>, num_bikes_ui <int>, num_lanes <chr>, ## # num_units <int>, distance_mi_from <chr>, frm_road <chr>, ## # rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, geo_shape <chr> ``` --- ### `filter` for many conditions at once for crashes in Durham County where biker was 0-5 years old ```r ncbikecrash %>% filter(county == "Durham", bike_age_group == "0-5") ``` ``` ## # A tibble: 4 x 66 ## object_id city county region development locality on_road rural_urban ## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 4062 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## 2 414 Durh… Durham Piedm… Residential Urban (… PVA 90… Urban ## 3 3016 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## 4 1383 Durh… Durham Piedm… Residential Urban (… PVA 62… Urban ## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>, ## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>, ## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>, ## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, ## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>, ## # driver_alcohol <chr>, driver_alcohol_drugs <chr>, ## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>, ## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>, ## # crash_date <chr>, crash_day <chr>, crash_group <chr>, ## # crash_hour <int>, crash_location <chr>, crash_month <chr>, ## # crash_severity <chr>, crash_time <time>, crash_type <chr>, ## # crash_year <int>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>, ## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>, ## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>, ## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>, ## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, ## # geo_shape <chr> ``` --- ### Logical operators in R operator | definition || operator | definition ------------|------------------------------||--------------|---------------- `<` | less than ||`x` | `y` | `x` OR `y` `<=` | less than or equal to ||`is.na(x)` | test if `x` is `NA` `>` | greater than ||`!is.na(x)` | test if `x` is not `NA` `>=` | greater than or equal to ||`x %in% y` | test if `x` is in `y` `==` | exactly equal to ||`!(x %in% y)` | test if `x` is not in `y` `!=` | not equal to ||`!x` | not `x` `x & y` | `x` AND `y` || | --- ### `select` to keep variables ```r ncbikecrash %>% filter(county == "Durham", bike_age_group == "0-5") %>% select(locality, speed_limit) ``` ``` ## # A tibble: 4 x 2 ## locality speed_limit ## <chr> <chr> ## 1 Urban (>70% Developed) 30 - 35 MPH ## 2 Urban (>70% Developed) 5 - 15 MPH ## 3 Urban (>70% Developed) 20 - 25 MPH ## 4 Urban (>70% Developed) 20 - 25 MPH ``` --- ### `select` to exclude variables ```r ncbikecrash %>% select(-object_id) ``` ``` ## # A tibble: 7,467 x 65 ## city county region development locality on_road rural_urban speed_limit ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 None… Wayne Coast… Farms, Woo… Rural (… SR 1915 Rural 50 - 55 M… ## 2 Hend… Vance Piedm… Residential Mixed (… NICHOL… Urban 30 - 35 M… ## 3 None… Linco… Piedm… Farms, Woo… Rural (… US 321 Rural 50 - 55 M… ## 4 Whit… Colum… Coast… Commercial Urban (… W BURK… Urban 30 - 35 M… ## 5 Wilm… New H… Coast… Residential Urban (… RACINE… Urban <NA> ## 6 None… Robes… Coast… Farms, Woo… Rural (… SR 1513 Rural 50 - 55 M… ## 7 None… Richm… Piedm… Residential Mixed (… SR 1903 Rural 30 - 35 M… ## 8 Rale… Wake Piedm… Commercial Urban (… PERSON… Urban 30 - 35 M… ## 9 Whit… Colum… Coast… Residential Rural (… FLOWER… Urban 30 - 35 M… ## 10 New … Craven Coast… Residential Urban (… SUTTON… Urban 20 - 25 M… ## # … with 7,457 more rows, and 57 more variables: traffic_control <chr>, ## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>, ## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>, ## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, ## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>, ## # driver_alcohol <chr>, driver_alcohol_drugs <chr>, ## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>, ## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>, ## # crash_date <chr>, crash_day <chr>, crash_group <chr>, ## # crash_hour <int>, crash_location <chr>, crash_month <chr>, ## # crash_severity <chr>, crash_time <time>, crash_type <chr>, ## # crash_year <int>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>, ## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>, ## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>, ## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>, ## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, ## # geo_shape <chr> ``` --- ### `select` a range of variables ```r ncbikecrash %>% select(city:locality) ``` ``` ## # A tibble: 7,467 x 5 ## city county region development locality ## <chr> <chr> <chr> <chr> <chr> ## 1 None - Rural … Wayne Coastal Farms, Woods, Pa… Rural (<30% Develop… ## 2 Henderson Vance Piedmo… Residential Mixed (30% To 70% D… ## 3 None - Rural … Lincoln Piedmo… Farms, Woods, Pa… Rural (<30% Develop… ## 4 Whiteville Columbus Coastal Commercial Urban (>70% Develop… ## 5 Wilmington New Hanov… Coastal Residential Urban (>70% Develop… ## 6 None - Rural … Robeson Coastal Farms, Woods, Pa… Rural (<30% Develop… ## 7 None - Rural … Richmond Piedmo… Residential Mixed (30% To 70% D… ## 8 Raleigh Wake Piedmo… Commercial Urban (>70% Develop… ## 9 Whiteville Columbus Coastal Residential Rural (<30% Develop… ## 10 New Bern Craven Coastal Residential Urban (>70% Develop… ## # … with 7,457 more rows ``` --- ### `slice` for certain row numbers First five ```r ncbikecrash %>% slice(1:5) ``` ``` ## # A tibble: 5 x 66 ## object_id city county region development locality on_road rural_urban ## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 1686 None… Wayne Coast… Farms, Woo… Rural (… SR 1915 Rural ## 2 1674 Hend… Vance Piedm… Residential Mixed (… NICHOL… Urban ## 3 1673 None… Linco… Piedm… Farms, Woo… Rural (… US 321 Rural ## 4 1687 Whit… Colum… Coast… Commercial Urban (… W BURK… Urban ## 5 1653 Wilm… New H… Coast… Residential Urban (… RACINE… Urban ## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>, ## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>, ## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>, ## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, ## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>, ## # driver_alcohol <chr>, driver_alcohol_drugs <chr>, ## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>, ## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>, ## # crash_date <chr>, crash_day <chr>, crash_group <chr>, ## # crash_hour <int>, crash_location <chr>, crash_month <chr>, ## # crash_severity <chr>, crash_time <time>, crash_type <chr>, ## # crash_year <int>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>, ## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>, ## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>, ## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>, ## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, ## # geo_shape <chr> ``` --- ### `slice` for certain row numbers Last five ```r last_row <- nrow(ncbikecrash) ncbikecrash %>% slice((last_row - 4):last_row) ``` ``` ## # A tibble: 5 x 66 ## object_id city county region development locality on_road rural_urban ## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 6989 High… Guilf… Piedm… Residential Urban (… <NA> Urban ## 2 6991 Wilm… New H… Coast… Residential Urban (… <NA> Urban ## 3 6995 Kins… Lenoir Coast… Commercial Urban (… <NA> Urban ## 4 6998 Faye… Cumbe… Coast… Residential Urban (… <NA> Urban ## 5 7000 None… Onslow Coast… Farms, Woo… Rural (… <NA> Rural ## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>, ## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>, ## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>, ## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, ## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>, ## # driver_alcohol <chr>, driver_alcohol_drugs <chr>, ## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>, ## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>, ## # crash_date <chr>, crash_day <chr>, crash_group <chr>, ## # crash_hour <int>, crash_location <chr>, crash_month <chr>, ## # crash_severity <chr>, crash_time <time>, crash_type <chr>, ## # crash_year <int>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>, ## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>, ## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>, ## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>, ## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, ## # geo_shape <chr> ``` --- ### `pull` to extract a column as a vector ```r ncbikecrash %>% slice(1:6) %>% pull(locality) ``` ``` ## [1] "Rural (<30% Developed)" "Mixed (30% To 70% Developed)" ## [3] "Rural (<30% Developed)" "Urban (>70% Developed)" ## [5] "Urban (>70% Developed)" "Rural (<30% Developed)" ``` vs. ```r ncbikecrash %>% slice(1:6) %>% select(locality) ``` ``` ## # A tibble: 6 x 1 ## locality ## <chr> ## 1 Rural (<30% Developed) ## 2 Mixed (30% To 70% Developed) ## 3 Rural (<30% Developed) ## 4 Urban (>70% Developed) ## 5 Urban (>70% Developed) ## 6 Rural (<30% Developed) ``` --- ### `sample_n` / `sample_frac` for a random sample - `sample_n`: randomly sample 5 observations ```r ncbikecrash_n5 <- ncbikecrash %>% sample_n(5, replace = FALSE) dim(ncbikecrash_n5) ``` ``` ## [1] 5 66 ``` - `sample_frac`: randomly sample 20% of observations ```r ncbikecrash_perc20 <-ncbikecrash %>% sample_frac(0.2, replace = FALSE) dim(ncbikecrash_perc20) ``` ``` ## [1] 1493 66 ``` --- ### `distinct` to filter for unique rows And `arrange` to order alphabetically ```r ncbikecrash %>% select(county, city) %>% distinct() %>% arrange(county, city) ``` ``` ## # A tibble: 391 x 2 ## county city ## <chr> <chr> ## 1 Alamance Alamance ## 2 Alamance Burlington ## 3 Alamance Elon ## 4 Alamance Elon College ## 5 Alamance Gibsonville ## 6 Alamance Graham ## 7 Alamance Green Level ## 8 Alamance Mebane ## 9 Alamance None - Rural Crash ## 10 Alexander None - Rural Crash ## # … with 381 more rows ``` --- ### `summarise` to reduce variables to values ```r ncbikecrash %>% summarise(avg_hr = mean(crash_hour)) ``` ``` ## # A tibble: 1 x 1 ## avg_hr ## <dbl> ## 1 14.7 ``` --- ### `group_by` to do calculations on groups ```r ncbikecrash %>% group_by(hit_run) %>% summarise(avg_hr = mean(crash_hour)) ``` ``` ## # A tibble: 2 x 2 ## hit_run avg_hr ## <chr> <dbl> ## 1 No 14.6 ## 2 Yes 15.0 ``` --- ### `count` observations in groups ```r ncbikecrash %>% count(driver_alcohol_drugs) ``` ``` ## # A tibble: 6 x 2 ## driver_alcohol_drugs n ## <chr> <int> ## 1 Missing 99 ## 2 No 695 ## 3 Yes-Alcohol, impairment suspected 12 ## 4 Yes-Alcohol, no impairment detected 3 ## 5 Yes-Drugs, impairment suspected 4 ## 6 <NA> 6654 ``` --- ### `mutate` to add new variables .small[ ```r ncbikecrash %>% mutate(driver_alcohol_drugs_simplified = case_when( driver_alcohol_drugs == "Missing" ~ NA, str_detect(driver_alcohol_drugs, "Yes") ~ "Yes", TRUE ~ "No" )) ``` ] --- ### "Save" when you `mutate` Most often when you define a new variable with `mutate` you'll also want to save the resulting data frame, often by writing over the original data frame. ```r ncbikecrash <- ncbikecrash %>% mutate(driver_alcohol_drugs_simplified = case_when( str_detect(driver_alcohol_drugs, "Yes") ~ "Yes", TRUE ~ driver_alcohol_drugs )) ``` --- ### Check before you move on ```r ncbikecrash %>% count(driver_alcohol_drugs, driver_alcohol_drugs_simplified) ``` ``` ## # A tibble: 6 x 3 ## driver_alcohol_drugs driver_alcohol_drugs_simplified n ## <chr> <chr> <int> ## 1 Missing Missing 99 ## 2 No No 695 ## 3 Yes-Alcohol, impairment suspected Yes 12 ## 4 Yes-Alcohol, no impairment detected Yes 3 ## 5 Yes-Drugs, impairment suspected Yes 4 ## 6 <NA> <NA> 6654 ``` ```r ncbikecrash %>% count(driver_alcohol_drugs_simplified) ``` ``` ## # A tibble: 4 x 2 ## driver_alcohol_drugs_simplified n ## <chr> <int> ## 1 Missing 99 ## 2 No 695 ## 3 Yes 19 ## 4 <NA> 6654 ``` --- ## AE 04 - NC bike crashes - Copy the NC Bike Crashes project on RStudio Cloud - For each question you work on, set the `eval` chunk option to `TRUE` and knit --- ## Before next class - You will get your teams in lab tomorrow!