class: center, middle, inverse, title-slide # Tidy data and data wrangling
🔧 ### Dr. Çetinkaya-Rundel --- layout: true <div class="my-footer"> <span> Dr. Mine Çetinkaya-Rundel - <a href="http://www2.stat.duke.edu/courses/Fall18/sta112.01/schedule" target="_blank">stat.duke.edu/courses/Fall18/sta112.01 </a> </span> </div> --- ## Announcements - Consider adding your picture to Slack and GitHub - Check your email for an invitation to the course GitHub organization, and join - Make use of office hours! --- class: center, middle # Tidy data --- ## Tidy data >Happy families are all alike; every unhappy family is unhappy in its own way. > >Leo Tolstoy -- .pull-left[ **Characteristics of tidy data:** - Each variable forms a column. - Each observation forms a row. - Each type of observational unit forms a table. ] -- .pull-right[ **Characteristics of untidy data:** !@#$%^&*() ] --- ## Summary tables .question[ Is each of the following a dataset or a summary table? ] .small[ .pull-left[ ``` ## # A tibble: 87 x 3 ## name height mass ## <chr> <int> <dbl> ## 1 Luke Skywalker 172 77 ## 2 C-3PO 167 75 ## 3 R2-D2 96 32 ## 4 Darth Vader 202 136 ## 5 Leia Organa 150 49 ## 6 Owen Lars 178 120 ## 7 Beru Whitesun lars 165 75 ## 8 R5-D4 97 32 ## 9 Biggs Darklighter 183 84 ## 10 Obi-Wan Kenobi 182 77 ## # ... with 77 more rows ``` ] .pull-right[ ``` ## # A tibble: 5 x 2 ## gender avg_height ## <chr> <dbl> ## 1 female 165. ## 2 hermaphrodite 175 ## 3 male 179. ## 4 none 200 ## 5 <NA> 120 ``` ] ] --- class: center, middle # Pipes --- ## Where does the name come from? The pipe operator is implemented in the package **magrittr**, it's pronounced "and then". .pull-left[  ] .pull-right[  ] --- ## Review: How does a pipe work? - You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park. - Expressed as a set of nested functions in R pseudocode this would look like: ```r park(drive(start_car(find("keys")), to = "campus")) ``` - Writing it out using pipes give it a more natural (and easier to read) structure: ```r find("keys") %>% start_car() %>% drive(to = "campus") %>% park() ``` --- ## What about other arguments? To send results to a function argument other than first one or to use the previous result for multiple arguments, use `.`: ```r starwars %>% filter(species == "Human") %>% lm(mass ~ height, data = .) ``` ``` ## ## Call: ## lm(formula = mass ~ height, data = .) ## ## Coefficients: ## (Intercept) height ## -116.58 1.11 ``` --- class: center, middle # Data wrangling --- ## Bike crashes in NC 2007 - 2014 The dataset is in the **dsbox** package: ```r library(dsbox) ncbikecrash ``` --- ## Variables View the names of variables via ```r names(ncbikecrash) ``` ``` ## [1] "object_id" "city" "county" ## [4] "region" "development" "locality" ## [7] "on_road" "rural_urban" "speed_limit" ## [10] "traffic_control" "weather" "workzone" ## [13] "bike_age" "bike_age_group" "bike_alcohol" ## [16] "bike_alcohol_drugs" "bike_direction" "bike_injury" ## [19] "bike_position" "bike_race" "bike_sex" ## [22] "driver_age" "driver_age_group" "driver_alcohol" ## [25] "driver_alcohol_drugs" "driver_est_speed" "driver_injury" ## [28] "driver_race" "driver_sex" "driver_vehicle_type" ## [31] "crash_alcohol" "crash_date" "crash_day" ## [34] "crash_group" "crash_hour" "crash_location" ## [37] "crash_month" "crash_severity" "crash_time" ## [40] "crash_type" "crash_year" "ambulance_req" ## [43] "hit_run" "light_condition" "road_character" ## [46] "road_class" "road_condition" "road_configuration" ## [49] "road_defects" "road_feature" "road_surface" ## [52] "num_bikes_ai" "num_bikes_bi" "num_bikes_ci" ## [55] "num_bikes_ki" "num_bikes_no" "num_bikes_to" ## [58] "num_bikes_ui" "num_lanes" "num_units" ## [61] "distance_mi_from" "frm_road" "rte_invd_cd" ## [64] "towrd_road" "geo_point" "geo_shape" ``` and see detailed descriptions with `?ncbikecrash`. --- ## Viewing your data - In the Environment, after loading with `data(ncbikecrash)`, and click on the name of the data frame to view it in the data viewer - Use the `glimpse` function to take a peek ```r glimpse(ncbikecrash) ``` ``` ## Observations: 7,467 ## Variables: 66 ## $ object_id <int> 1686, 1674, 1673, 1687, 1653, 1665, 1642,... ## $ city <chr> "None - Rural Crash", "Henderson", "None ... ## $ county <chr> "Wayne", "Vance", "Lincoln", "Columbus", ... ## $ region <chr> "Coastal", "Piedmont", "Piedmont", "Coast... ## $ development <chr> "Farms, Woods, Pastures", "Residential", ... ## $ locality <chr> "Rural (<30% Developed)", "Mixed (30% To ... ## $ on_road <chr> "SR 1915", "NICHOLAS ST", "US 321", "W BU... ## $ rural_urban <chr> "Rural", "Urban", "Rural", "Urban", "Urba... ## $ speed_limit <chr> "50 - 55 MPH", "30 - 35 MPH", "50 - 55 ... ## $ traffic_control <chr> "No Control Present", "Stop Sign", "Doubl... ## $ weather <chr> "Clear", "Clear", "Clear", "Rain", "Clear... ## $ workzone <chr> "No", "No", "No", "No", "No", "No", "No",... ## $ bike_age <chr> "52", "66", "33", "52", "22", "15", "41",... ## $ bike_age_group <chr> "50-59", "60-69", "30-39", "50-59", "20-2... ## $ bike_alcohol <chr> "No", "No", "No", "Yes", "No", "No", "No"... ## $ bike_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N... ## $ bike_direction <chr> "With Traffic", "With Traffic", "With Tra... ## $ bike_injury <chr> "B: Evident Injury", "C: Possible Injury"... ## $ bike_position <chr> "Bike Lane / Paved Shoulder", "Travel Lan... ## $ bike_race <chr> "Black", "Black", "White", "Black", "Whit... ## $ bike_sex <chr> "Male", "Male", "Male", "Male", "Female",... ## $ driver_age <chr> "34", NA, "37", "55", "25", "17", NA, "50... ## $ driver_age_group <chr> "30-39", NA, "30-39", "50-59", "25-29", "... ## $ driver_alcohol <chr> "No", "Missing", "No", "No", "No", "No", ... ## $ driver_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N... ## $ driver_est_speed <chr> "51-55 mph", "6-10 mph", "41-45 mph", "11... ## $ driver_injury <chr> "O: No Injury", "Unknown Injury", "O: No ... ## $ driver_race <chr> "White", "Unknown/Missing", "Hispanic", "... ## $ driver_sex <chr> "Male", NA, "Female", "Male", "Male", "Fe... ## $ driver_vehicle_type <chr> "Single Unit Truck (2-Axle, 6-Tire)", NA,... ## $ crash_alcohol <chr> "No", "No", "No", "Yes", "No", "No", "No"... ## $ crash_date <chr> "11DEC2013", "20NOV2013", "03NOV2013", "1... ## $ crash_day <chr> "Wednesday", "Wednesday", "Sunday", "Satu... ## $ crash_group <chr> "Motorist Overtaking Bicyclist", "Bicycli... ## $ crash_hour <int> 6, 20, 18, 18, 13, 17, 17, 7, 15, 2, 12, ... ## $ crash_location <chr> "Non-Intersection", "Intersection", "Non-... ## $ crash_month <chr> "December", "November", "November", "Dece... ## $ crash_severity <chr> "B: Evident Injury", "C: Possible Injury"... ## $ crash_time <time> 06:10:00, 20:41:00, 18:05:00, 18:34:00, ... ## $ crash_type <chr> "Motorist Overtaking - Undetected Bicycli... ## $ crash_year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013,... ## $ ambulance_req <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", ... ## $ hit_run <chr> "No", "Yes", "No", "No", "No", "No", "Yes... ## $ light_condition <chr> "Dark - Roadway Not Lighted", NA, "Dark -... ## $ road_character <chr> "Straight - Level", "Straight - Level", "... ## $ road_class <chr> "State Secondary Route", "Local Street", ... ## $ road_condition <chr> "Dry", "Dry", "Dry", "Water (Standing, Mo... ## $ road_configuration <chr> "Two-Way, Not Divided", "Two-Way, Divided... ## $ road_defects <chr> "None", NA, "None", "None", "None", "None... ## $ road_feature <chr> "No Special Feature", "T-Intersection", "... ## $ road_surface <chr> "Coarse Asphalt", "Smooth Asphalt", "Smoo... ## $ num_bikes_ai <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... ## $ num_bikes_bi <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... ## $ num_bikes_ci <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... ## $ num_bikes_ki <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... ## $ num_bikes_no <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... ## $ num_bikes_to <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... ## $ num_bikes_ui <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... ## $ num_lanes <chr> "2 lanes", "2 lanes", "2 lanes", "1 lane"... ## $ num_units <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,... ## $ distance_mi_from <chr> "0", "0", "0", "0", "0", "0", "0", "0", "... ## $ frm_road <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N... ## $ rte_invd_cd <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... ## $ towrd_road <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N... ## $ geo_point <chr> "35.3336070056, -77.9955023901", "36.3151... ## $ geo_shape <chr> "{\"type\": \"Point\", \"coordinates\": [... ``` --- ## A Grammar of Data Manipulation **dplyr** is based on the concepts of functions as verbs that manipulate data frames. .pull-left[  ] .pull-right[ .midi[ - `filter`: pick rows matching criteria - `slice`: pick rows using index(es) - `select`: pick columns by name - `pull`: grab a column as a vector - `arrange`: reorder rows - `mutate`: add new variables - `distinct`: filter for unique rows - `sample_n` / `sample_frac`: randomly sample rows - `summarise`: reduce variables to values - ... (many more) ] ] --- ## **dplyr** rules for functions - First argument is *always* a data frame - Subsequent arguments say what to do with that data frame - Always return a data frame - Don't modify in place --- ## A note on piping and layering - The `%>%` operator in **dplyr** functions is called the pipe operator. This means you "pipe" the output of the previous line of code as the first input of the next line of code. - The `+` operator in **ggplot2** functions is used for "layering". This means you create the plot in layers, separated by `+`. --- ## `filter` to select a subset of rows for crashes in Durham County ```r ncbikecrash %>% * filter(county == "Durham") ``` ``` ## # A tibble: 340 x 66 ## object_id city county region development locality on_road rural_urban ## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 2452 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## 2 2441 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 3 2466 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 4 549 Durh… Durham Piedm… Residential Urban (… PARK A… Urban ## 5 598 Durh… Durham Piedm… Residential Urban (… BELT S… Urban ## 6 603 Durh… Durham Piedm… Residential Urban (… HINSON… Urban ## 7 3974 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 8 7134 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 9 1670 Durh… Durham Piedm… Commercial Urban (… INFINI… Urban ## 10 1773 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## # ... with 330 more rows, and 58 more variables: speed_limit <chr>, ## # traffic_control <chr>, weather <chr>, workzone <chr>, bike_age <chr>, ## # bike_age_group <chr>, bike_alcohol <chr>, bike_alcohol_drugs <chr>, ## # bike_direction <chr>, bike_injury <chr>, bike_position <chr>, ## # bike_race <chr>, bike_sex <chr>, driver_age <chr>, ## # driver_age_group <chr>, driver_alcohol <chr>, ## # driver_alcohol_drugs <chr>, driver_est_speed <chr>, ## # driver_injury <chr>, driver_race <chr>, driver_sex <chr>, ## # driver_vehicle_type <chr>, crash_alcohol <chr>, crash_date <chr>, ## # crash_day <chr>, crash_group <chr>, crash_hour <int>, ## # crash_location <chr>, crash_month <chr>, crash_severity <chr>, ## # crash_time <time>, crash_type <chr>, crash_year <int>, ## # ambulance_req <chr>, hit_run <chr>, light_condition <chr>, ## # road_character <chr>, road_class <chr>, road_condition <chr>, ## # road_configuration <chr>, road_defects <chr>, road_feature <chr>, ## # road_surface <chr>, num_bikes_ai <int>, num_bikes_bi <int>, ## # num_bikes_ci <int>, num_bikes_ki <int>, num_bikes_no <int>, ## # num_bikes_to <int>, num_bikes_ui <int>, num_lanes <chr>, ## # num_units <int>, distance_mi_from <chr>, frm_road <chr>, ## # rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, geo_shape <chr> ``` --- ## `filter` for many conditions at once for crashes in Durham County where biker was 0-5 years old ```r ncbikecrash %>% filter(county == "Durham", bike_age_group == "0-5") ``` ``` ## # A tibble: 4 x 66 ## object_id city county region development locality on_road rural_urban ## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 4062 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## 2 414 Durh… Durham Piedm… Residential Urban (… PVA 90… Urban ## 3 3016 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## 4 1383 Durh… Durham Piedm… Residential Urban (… PVA 62… Urban ## # ... with 58 more variables: speed_limit <chr>, traffic_control <chr>, ## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>, ## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>, ## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, ## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>, ## # driver_alcohol <chr>, driver_alcohol_drugs <chr>, ## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>, ## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>, ## # crash_date <chr>, crash_day <chr>, crash_group <chr>, ## # crash_hour <int>, crash_location <chr>, crash_month <chr>, ## # crash_severity <chr>, crash_time <time>, crash_type <chr>, ## # crash_year <int>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>, ## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>, ## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>, ## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>, ## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, ## # geo_shape <chr> ``` --- ## Logical operators in R operator | definition || operator | definition ------------|------------------------------||--------------|---------------- `<` | less than ||`x` | `y` | `x` OR `y` `<=` | less than or equal to ||`is.na(x)` | test if `x` is `NA` `>` | greater than ||`!is.na(x)` | test if `x` is not `NA` `>=` | greater than or equal to ||`x %in% y` | test if `x` is in `y` `==` | exactly equal to ||`!(x %in% y)` | test if `x` is not in `y` `!=` | not equal to ||`!x` | not `x` `x & y` | `x` AND `y` || | --- ## `select` to keep variables ```r ncbikecrash %>% filter(county == "Durham", bike_age_group == "0-5") %>% select(locality, speed_limit) ``` ``` ## # A tibble: 4 x 2 ## locality speed_limit ## <chr> <chr> ## 1 Urban (>70% Developed) 30 - 35 MPH ## 2 Urban (>70% Developed) 5 - 15 MPH ## 3 Urban (>70% Developed) 20 - 25 MPH ## 4 Urban (>70% Developed) 20 - 25 MPH ``` --- ## `select` to exclude variables ```r ncbikecrash %>% select(-object_id) ``` ``` ## # A tibble: 7,467 x 65 ## city county region development locality on_road rural_urban speed_limit ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 None… Wayne Coast… Farms, Woo… Rural (… SR 1915 Rural 50 - 55 M… ## 2 Hend… Vance Piedm… Residential Mixed (… NICHOL… Urban 30 - 35 M… ## 3 None… Linco… Piedm… Farms, Woo… Rural (… US 321 Rural 50 - 55 M… ## 4 Whit… Colum… Coast… Commercial Urban (… W BURK… Urban 30 - 35 M… ## 5 Wilm… New H… Coast… Residential Urban (… RACINE… Urban <NA> ## 6 None… Robes… Coast… Farms, Woo… Rural (… SR 1513 Rural 50 - 55 M… ## 7 None… Richm… Piedm… Residential Mixed (… SR 1903 Rural 30 - 35 M… ## 8 Rale… Wake Piedm… Commercial Urban (… PERSON… Urban 30 - 35 M… ## 9 Whit… Colum… Coast… Residential Rural (… FLOWER… Urban 30 - 35 M… ## 10 New … Craven Coast… Residential Urban (… SUTTON… Urban 20 - 25 M… ## # ... with 7,457 more rows, and 57 more variables: traffic_control <chr>, ## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>, ## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>, ## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, ## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>, ## # driver_alcohol <chr>, driver_alcohol_drugs <chr>, ## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>, ## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>, ## # crash_date <chr>, crash_day <chr>, crash_group <chr>, ## # crash_hour <int>, crash_location <chr>, crash_month <chr>, ## # crash_severity <chr>, crash_time <time>, crash_type <chr>, ## # crash_year <int>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>, ## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>, ## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>, ## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>, ## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, ## # geo_shape <chr> ``` --- ## `select` a range of variables ```r ncbikecrash %>% select(city:locality) ``` ``` ## # A tibble: 7,467 x 5 ## city county region development locality ## <chr> <chr> <chr> <chr> <chr> ## 1 None - Rural… Wayne Coastal Farms, Woods, Pa… Rural (<30% Develop… ## 2 Henderson Vance Piedmo… Residential Mixed (30% To 70% D… ## 3 None - Rural… Lincoln Piedmo… Farms, Woods, Pa… Rural (<30% Develop… ## 4 Whiteville Columbus Coastal Commercial Urban (>70% Develop… ## 5 Wilmington New Hanov… Coastal Residential Urban (>70% Develop… ## 6 None - Rural… Robeson Coastal Farms, Woods, Pa… Rural (<30% Develop… ## 7 None - Rural… Richmond Piedmo… Residential Mixed (30% To 70% D… ## 8 Raleigh Wake Piedmo… Commercial Urban (>70% Develop… ## 9 Whiteville Columbus Coastal Residential Rural (<30% Develop… ## 10 New Bern Craven Coastal Residential Urban (>70% Develop… ## # ... with 7,457 more rows ``` --- ## `slice` for certain row numbers First five ```r ncbikecrash %>% slice(1:5) ``` ``` ## # A tibble: 5 x 66 ## object_id city county region development locality on_road rural_urban ## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 1686 None… Wayne Coast… Farms, Woo… Rural (… SR 1915 Rural ## 2 1674 Hend… Vance Piedm… Residential Mixed (… NICHOL… Urban ## 3 1673 None… Linco… Piedm… Farms, Woo… Rural (… US 321 Rural ## 4 1687 Whit… Colum… Coast… Commercial Urban (… W BURK… Urban ## 5 1653 Wilm… New H… Coast… Residential Urban (… RACINE… Urban ## # ... with 58 more variables: speed_limit <chr>, traffic_control <chr>, ## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>, ## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>, ## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, ## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>, ## # driver_alcohol <chr>, driver_alcohol_drugs <chr>, ## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>, ## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>, ## # crash_date <chr>, crash_day <chr>, crash_group <chr>, ## # crash_hour <int>, crash_location <chr>, crash_month <chr>, ## # crash_severity <chr>, crash_time <time>, crash_type <chr>, ## # crash_year <int>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>, ## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>, ## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>, ## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>, ## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, ## # geo_shape <chr> ``` --- ## `slice` for certain row numbers Last five ```r last_row <- nrow(ncbikecrash) ncbikecrash %>% slice((last_row - 4):last_row) ``` ``` ## # A tibble: 5 x 66 ## object_id city county region development locality on_road rural_urban ## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 6989 High… Guilf… Piedm… Residential Urban (… <NA> Urban ## 2 6991 Wilm… New H… Coast… Residential Urban (… <NA> Urban ## 3 6995 Kins… Lenoir Coast… Commercial Urban (… <NA> Urban ## 4 6998 Faye… Cumbe… Coast… Residential Urban (… <NA> Urban ## 5 7000 None… Onslow Coast… Farms, Woo… Rural (… <NA> Rural ## # ... with 58 more variables: speed_limit <chr>, traffic_control <chr>, ## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>, ## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>, ## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, ## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>, ## # driver_alcohol <chr>, driver_alcohol_drugs <chr>, ## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>, ## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>, ## # crash_date <chr>, crash_day <chr>, crash_group <chr>, ## # crash_hour <int>, crash_location <chr>, crash_month <chr>, ## # crash_severity <chr>, crash_time <time>, crash_type <chr>, ## # crash_year <int>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>, ## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>, ## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>, ## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>, ## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, ## # geo_shape <chr> ``` --- ## `pull` to extract a column as a vector ```r ncbikecrash %>% slice(1:6) %>% pull(locality) ``` ``` ## [1] "Rural (<30% Developed)" "Mixed (30% To 70% Developed)" ## [3] "Rural (<30% Developed)" "Urban (>70% Developed)" ## [5] "Urban (>70% Developed)" "Rural (<30% Developed)" ``` vs. ```r ncbikecrash %>% slice(1:6) %>% select(locality) ``` ``` ## # A tibble: 6 x 1 ## locality ## <chr> ## 1 Rural (<30% Developed) ## 2 Mixed (30% To 70% Developed) ## 3 Rural (<30% Developed) ## 4 Urban (>70% Developed) ## 5 Urban (>70% Developed) ## 6 Rural (<30% Developed) ``` --- ## `sample_n` / `sample_frac` for a random sample - `sample_n`: randomly sample 5 observations ```r ncbikecrash_n5 <- ncbikecrash %>% sample_n(5, replace = FALSE) dim(ncbikecrash_n5) ``` ``` ## [1] 5 66 ``` - `sample_frac`: randomly sample 20% of observations ```r ncbikecrash_perc20 <-ncbikecrash %>% sample_frac(0.2, replace = FALSE) dim(ncbikecrash_perc20) ``` ``` ## [1] 1493 66 ``` --- ## `distinct` to filter for unique rows And `arrange` to order alphabetically ```r ncbikecrash %>% select(county, city) %>% distinct() %>% arrange(county, city) ``` ``` ## # A tibble: 391 x 2 ## county city ## <chr> <chr> ## 1 Alamance Alamance ## 2 Alamance Burlington ## 3 Alamance Elon ## 4 Alamance Elon College ## 5 Alamance Gibsonville ## 6 Alamance Graham ## 7 Alamance Green Level ## 8 Alamance Mebane ## 9 Alamance None - Rural Crash ## 10 Alexander None - Rural Crash ## # ... with 381 more rows ``` --- ## `summarise` to reduce variables to values ```r ncbikecrash %>% summarise(avg_hr = mean(crash_hour)) ``` ``` ## # A tibble: 1 x 1 ## avg_hr ## <dbl> ## 1 14.7 ``` --- ## `group_by` to do calculations on groups ```r ncbikecrash %>% group_by(hit_run) %>% summarise(avg_hr = mean(crash_hour)) ``` ``` ## # A tibble: 2 x 2 ## hit_run avg_hr ## <chr> <dbl> ## 1 No 14.6 ## 2 Yes 15.0 ``` --- ## `count` observations in groups ```r ncbikecrash %>% count(driver_alcohol_drugs) ``` ``` ## # A tibble: 6 x 2 ## driver_alcohol_drugs n ## <chr> <int> ## 1 Missing 99 ## 2 No 695 ## 3 Yes-Alcohol, impairment suspected 12 ## 4 Yes-Alcohol, no impairment detected 3 ## 5 Yes-Drugs, impairment suspected 4 ## 6 <NA> 6654 ``` --- ## `mutate` to add new variables ```r ncbikecrash %>% mutate(driver_alcohol_drugs_simplified = case_when( driver_alcohol_drugs == "Missing" ~ NA, str_detect(driver_alcohol_drugs, "Yes") ~ "Yes", TRUE ~ "No" )) ``` --- ## "Save" when you `mutate` Most often when you define a new variable with `mutate` you'll also want to save the resulting data frame, often by writing over the original data frame. ```r ncbikecrash <- ncbikecrash %>% mutate(driver_alcohol_drugs_simplified = case_when( str_detect(driver_alcohol_drugs, "Yes") ~ "Yes", TRUE ~ driver_alcohol_drugs )) ``` --- ## Check before you move on ```r ncbikecrash %>% count(driver_alcohol_drugs, driver_alcohol_drugs_simplified) ``` ``` ## # A tibble: 6 x 3 ## driver_alcohol_drugs driver_alcohol_drugs_simplified n ## <chr> <chr> <int> ## 1 Missing Missing 99 ## 2 No No 695 ## 3 Yes-Alcohol, impairment suspected Yes 12 ## 4 Yes-Alcohol, no impairment detected Yes 3 ## 5 Yes-Drugs, impairment suspected Yes 4 ## 6 <NA> <NA> 6654 ``` ```r ncbikecrash %>% count(driver_alcohol_drugs_simplified) ``` ``` ## # A tibble: 4 x 2 ## driver_alcohol_drugs_simplified n ## <chr> <int> ## 1 Missing 99 ## 2 No 695 ## 3 Yes 19 ## 4 <NA> 6654 ``` --- class: center, middle # Application exercise --- ## `ae-04-ncbikecrash` But, not so fast... Let's first talk a bit about Git and GitHub. --- class: center, middle # Git and GitHub --- ## Version control - We introduced GitHub as a platform for collaboration - But it's much more than that... - It's actually desiged for version control --- ## Versioning <img src="img/lego-steps.png" width="1200" style="display: block; margin: auto;" /> --- ## Versioning with human readable messages <img src="img/lego-steps-commit-messages.png" width="1200" style="display: block; margin: auto;" /> --- ## Why do we need version control? <img src="img/phd_comics_vc.gif" style="display: block; margin: auto;" /> --- # Git and GitHub tips - Git is a version control system -- like “Track Changes” features from Microsoft Word on steroids. GitHub is the home for your Git-based projects on the internet -- like DropBox but much, much better). -- - There are millions of git commands -- ok, that's an exaggeration, but there are a lot of them -- and very few people know them all. 99% of the time you will use git to add, commit, push, and pull. -- - We will be doing Git things and interfacing with GitHub through RStudio, but if you google for help you might come accross methods for doing these things in the command line -- skip that and move on to the next resource unless you feel comfortable trying it out. -- - There is a great resource for working with git and R: [happygitwithr.com](http://happygitwithr.com/). Some of the content in there is beyond the scope of this course, but it's a good place to look for help. --- ## Application exercise: <br> `ae-04-ncbikecrash` - Practice with Git and GitHub - Concepts introduced: - Connect an R project to Github repository - Working with a local and remote repository - Committing, Pushing and Pulling - There is just a bit more of GitHub that we'll use in this class, but for today this is enough