class: center, middle, inverse, title-slide # Tidy data and data wrangling ### Dr. Çetinkaya-Rundel ### 2018-01-24 --- ## Announcements - Class recordings linked on course page -- I'll email the link as well - Look out for an email about your team tonight, start thinking about a team name asap - --- class: center, middle # Tidy data --- ## Tidy data >Happy families are all alike; every unhappy family is unhappy in its own way. > >Leo Tolstoy -- .pull-left[ **Characteristics of tidy data:** 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table. ] -- .pull-right[ **Characteristics of untidy data:** !@#$%^&*() ] --- ## Summary tables .question[ Is each of the following a dataset or a summary table? ] .small[ .pull-left[
] .pull-right[
] ] --- class: center, middle # Pipes --- ## Where does the name come from? The pipe operator is implemented in the package **magrittr**. .pull-left[  ] .pull-right[  ] --- ## Review: How does a pipe work? - You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park. - Expressed as a set of nested functions in R pseudocode this would look like: ```r park(drive(start_car(find("keys")), to = "campus")) ``` - Writing it out using pipes give it a more natural (and easier to read) structure: ```r find("keys") %>% start_car() %>% drive(to = "campus") %>% park() ``` --- ## What about other arguments? To send results to a function argument other than first one or to use the previous result for multiple arguments, use `.`: ```r starwars %>% filter(species == "Human") %>% lm(mass ~ height, data = .) ``` ``` ## ## Call: ## lm(formula = mass ~ height, data = .) ## ## Coefficients: ## (Intercept) height ## -116.58 1.11 ``` --- class: center, middle # Data wrangling --- ## NC DOT Fatal Crashes in North Carolina From https://opendurham.nc.gov ```r bike <- read_csv2("https://stat.duke.edu/~mc301/data/nc_bike_crash.csv", na = c("NA", "", ".")) ``` --- ## Variables View the names of variables via ```r names(bike) ``` ``` ## [1] "FID" "OBJECTID" "AmbulanceR" "BikeAge_Gr" "Bike_Age" ## [6] "Bike_Alc_D" "Bike_Dir" "Bike_Injur" "Bike_Pos" "Bike_Race" ## [11] "Bike_Sex" "City" "County" "CrashAlcoh" "CrashDay" ## [16] "Crash_Date" "Crash_Grp" "Crash_Hour" "Crash_Loc" "Crash_Mont" ## [21] "Crash_Time" "Crash_Type" "Crash_Ty_1" "Crash_Year" "Crsh_Sevri" ## [26] "Developmen" "DrvrAge_Gr" "Drvr_Age" "Drvr_Alc_D" "Drvr_EstSp" ## [31] "Drvr_Injur" "Drvr_Race" "Drvr_Sex" "Drvr_VehTy" "ExcsSpdInd" ## [36] "Hit_Run" "Light_Cond" "Locality" "Num_Lanes" "Num_Units" ## [41] "Rd_Charact" "Rd_Class" "Rd_Conditi" "Rd_Config" "Rd_Defects" ## [46] "Rd_Feature" "Rd_Surface" "Region" "Rural_Urba" "Speed_Limi" ## [51] "Traff_Cntr" "Weather" "Workzone_I" "Location" ``` and see detailed descriptions at https://stat.duke.edu/~mc301/data/nc_bike_crash.html. --- ## Want to follow along? Go to RStudio Cloud and make a copy of "NC Bike Crashes - Wrangling". --- ## Viewing your data - In the Environment, click on the name of the data frame to view it in the data viewer - Use the `glimpse` function to take a peek ```r glimpse(bike) ``` ``` ## Observations: 5,716 ## Variables: 54 ## $ FID <int> 18, 29, 33, 35, 49, 53, 56, 60, 63, 66, 72, 75, 82,... ## $ OBJECTID <int> 19, 30, 34, 36, 50, 54, 57, 61, 64, 67, 73, 76, 83,... ## $ AmbulanceR <chr> "No", "Yes", "No", "Yes", "No", "Yes", "Yes", "No",... ## $ BikeAge_Gr <chr> NA, "50-59", NA, "16-19", NA, "50-59", "16-19", "40... ## $ Bike_Age <int> 6, 51, 10, 17, 6, 52, 18, 40, 6, 7, 45, 30, 17, 20,... ## $ Bike_Alc_D <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No... ## $ Bike_Dir <chr> "Not Applicable", "With Traffic", "With Traffic", N... ## $ Bike_Injur <chr> "C: Possible Injury", "C: Possible Injury", "Injury... ## $ Bike_Pos <chr> "Driveway / Alley", "Travel Lane", "Travel Lane", "... ## $ Bike_Race <chr> "Black", "Black", "Black", "White", "Black", "White... ## $ Bike_Sex <chr> "Female", "Male", "Male", "Male", "Male", "Male", "... ## $ City <chr> "Durham", "Greenville", "Farmville", "Charlotte", "... ## $ County <chr> "Durham", "Pitt", "Pitt", "Mecklenburg", "Mecklenbu... ## $ CrashAlcoh <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No... ## $ CrashDay <chr> "01-01-06", "01-01-02", "01-01-07", "01-01-05", NA,... ## $ Crash_Date <date> 2007-01-06, 2007-01-09, 2007-01-14, 2007-01-12, 20... ## $ Crash_Grp <chr> "Bicyclist Failed to Yield - Midblock", "Crossing P... ## $ Crash_Hour <int> 13, 23, 16, 19, 12, 20, 19, 14, 16, 0, 17, 18, 14, ... ## $ Crash_Loc <chr> "Non-Intersection", "Intersection-Related", "Inters... ## $ Crash_Mont <chr> NA, NA, NA, NA, NA, "01-04-01", "01-04-01", NA, "01... ## $ Crash_Time <dttm> 0001-01-01 13:17:58, 0001-01-01 23:08:58, 0001-01-... ## $ Crash_Type <chr> "Bicyclist Ride Out - Residential Driveway", "Cross... ## $ Crash_Ty_1 <int> 353311, 211180, 111144, 119139, 112114, 311231, 119... ## $ Crash_Year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200... ## $ Crsh_Sevri <chr> "C: Possible Injury", "C: Possible Injury", "O: No ... ## $ Developmen <chr> "Residential", "Commercial", "Residential", "Reside... ## $ DrvrAge_Gr <chr> "60-69", "30-39", "50-59", "30-39", NA, "20-24", "4... ## $ Drvr_Age <int> 66, 34, 52, 33, NA, 20, 40, NA, 17, 51, NA, 64, 50,... ## $ Drvr_Alc_D <chr> "No", "No", "No", "No", "Missing", "No", "No", "Mis... ## $ Drvr_EstSp <chr> "11-15 mph", "0-5 mph", "21-25 mph", "46-50 mph", "... ## $ Drvr_Injur <chr> "O: No Injury", "O: No Injury", "O: No Injury", "O:... ## $ Drvr_Race <chr> "Black", "Black", "White", "White", "/Missing", "Wh... ## $ Drvr_Sex <chr> "Male", "Male", "Female", "Female", NA, "Female", "... ## $ Drvr_VehTy <chr> "Pickup", "Passenger Car", "Passenger Car", "Sport ... ## $ ExcsSpdInd <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No... ## $ Hit_Run <chr> "No", "No", "No", "No", "Yes", "No", "No", "Yes", "... ## $ Light_Cond <chr> "Daylight", "Dark - Lighted Roadway", "Daylight", "... ## $ Locality <chr> "Mixed (30% To 70% Developed)", "Urban (>70% Develo... ## $ Num_Lanes <chr> "2 lanes", "5 lanes", "2 lanes", "4 lanes", "2 lane... ## $ Num_Units <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ... ## $ Rd_Charact <chr> "Straight - Level", "Straight - Level", "Straight -... ## $ Rd_Class <chr> "Local Street", "Local Street", "Local Street", "NC... ## $ Rd_Conditi <chr> "Dry", "Dry", "Dry", "Dry", "Dry", "Dry", "Dry", "D... ## $ Rd_Config <chr> "Two-Way, Not Divided", "Two-Way, Divided, Unprotec... ## $ Rd_Defects <chr> "None", "None", "None", "None", "None", "None", "No... ## $ Rd_Feature <chr> "No Special Feature", "Four-Way Intersection", "Fou... ## $ Rd_Surface <chr> "Smooth Asphalt", "Smooth Asphalt", "Smooth Asphalt... ## $ Region <chr> "Piedmont", "Coastal", "Coastal", "Piedmont", "Pied... ## $ Rural_Urba <chr> "Urban", "Urban", "Rural", "Urban", "Urban", "Urban... ## $ Speed_Limi <chr> "20 - 25 MPH", "40 - 45 MPH", "30 - 35 MPH", "40... ## $ Traff_Cntr <chr> "No Control Present", "Stop And Go Signal", "Stop S... ## $ Weather <chr> "Clear", "Clear", "Clear", "Cloudy", "Clear", "Clea... ## $ Workzone_I <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No... ## $ Location <chr> "36.002743, -78.8785", "35.612984, -77.39265", "35.... ``` --- ## A Grammar of Data Manipulation **dplyr** is based on the concepts of functions as verbs that manipulate data frames. Single data frame functions / verbs: * `filter`: pick rows matching criteria * `slice`: pick rows using index(es) * `select`: pick columns by name * `pull`: grab a column as a vector * `rename`: rename specific columns * `arrange`: reorder rows * `mutate`: add new variables * `transmute`: create new data frame with variables * `distinct`: filter for unique rows * `sample_n` / `sample_frac`: randomly sample rows * `summarise`: reduce variables to values * ... (many more) --- ## **dplyr** rules for functions 1. First argument is *always* a data frame 2. Subsequent arguments say what to do with that data frame 3. Always return a data frame 4. Don't modify in place 5. Performance via lazy evaluation --- ## A note on piping and layering - The `%>%` operator in **dplyr** functions is called the pipe operator. This means you "pipe" the output of the previous line of code as the first input of the next line of code. - The `+` operator in **ggplot2** functions is used for "layering". This means you create the plot in layers, separated by `+`. --- ## Filter rows with `filter` - Select a subset of rows in a data frame. - Easily filter for many conditions at once. --- ## `filter` for crashes in Durham County ```r bike %>% filter(County == "Durham") ``` ``` ## # A tibble: 253 x 54 ## FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir ## <int> <int> <chr> <chr> <int> <chr> <chr> ## 1 18 19 No <NA> 6 No Not Applicable ## 2 53 54 Yes 50-59 52 No With Traffic ## 3 56 57 Yes 16-19 18 No <NA> ## 4 209 210 No 16-19 16 No Facing Traffic ## 5 228 229 Yes 40-49 40 No With Traffic ## 6 620 621 Yes 50-59 55 No With Traffic ## 7 667 668 Yes 60-69 61 No Not Applicable ## 8 458 459 Yes 60-69 62 No With Traffic ## 9 576 577 No 40-49 49 No With Traffic ## 10 618 619 No 20-24 23 No With Traffic ## # ... with 243 more rows, and 47 more variables: Bike_Injur <chr>, ## # Bike_Pos <chr>, Bike_Race <chr>, Bike_Sex <chr>, City <chr>, ## # County <chr>, CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, ## # Crash_Grp <chr>, Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>, ## # Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>, ## # Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>, ## # DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, ## # Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>, ## # ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>, ## # Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>, ## # Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>, ## # Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>, ## # Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr> ``` --- ## `filter` for crashes in Durham County where biker was < 10 yrs old ```r bike %>% filter(County == "Durham", Bike_Age < 10) ``` ``` ## # A tibble: 20 x 54 ## FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir ## <int> <int> <chr> <chr> <int> <chr> <chr> ## 1 18 19 No <NA> 6 No Not Applicable ## 2 47 48 No 10-Jun 9 No Not Applicable ## 3 124 125 Yes 10-Jun 8 No With Traffic ## 4 531 532 Yes 10-Jun 7 No With Traffic ## 5 704 705 Yes 10-Jun 9 No Not Applicable ## 6 42 43 No 10-Jun 8 No With Traffic ## 7 392 393 Yes 0-5 2 No Not Applicable ## 8 941 942 No 10-Jun 9 No With Traffic ## 9 436 437 Yes 10-Jun 6 No Not Applicable ## 10 160 161 Yes 10-Jun 7 No With Traffic ## 11 273 274 Yes 10-Jun 7 No Facing Traffic ## 12 78 79 Yes 10-Jun 7 No With Traffic ## 13 422 423 No 10-Jun 9 No Not Applicable ## 14 570 571 No <NA> 0 Missing Not Applicable ## 15 683 684 Yes 10-Jun 8 No Not Applicable ## 16 62 63 Yes 10-Jun 7 No With Traffic ## 17 248 249 No 0-5 4 No Not Applicable ## 18 306 307 Yes 10-Jun 8 No With Traffic ## 19 231 232 Yes 10-Jun 8 No With Traffic ## 20 361 362 Yes 10-Jun 9 No With Traffic ## # ... with 47 more variables: Bike_Injur <chr>, Bike_Pos <chr>, ## # Bike_Race <chr>, Bike_Sex <chr>, City <chr>, County <chr>, ## # CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, Crash_Grp <chr>, ## # Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>, ## # Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>, ## # Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>, ## # DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, ## # Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>, ## # ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>, ## # Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>, ## # Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>, ## # Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>, ## # Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr> ``` --- ## Commonly used logical operators in R operator | definition || operator | definition ------------|------------------------------||--------------|---------------- `<` | less than ||`x` | `y` | `x` OR `y` `<=` | less than or equal to ||`is.na(x)` | test if `x` is `NA` `>` | greater than ||`!is.na(x)` | test if `x` is not `NA` `>=` | greater than or equal to ||`x %in% y` | test if `x` is in `y` `==` | exactly equal to ||`!(x %in% y)` | test if `x` is not in `y` `!=` | not equal to ||`!x` | not `x` `x & y` | `x` AND `y` || | --- ## Aside: real data is messy! .question[ What in the world does a `BikeAge_gr` of `10-Jun` or `15-Nov` mean? ] .small[ ```r bike %>% group_by(BikeAge_Gr) %>% summarise(crash_count = n()) ``` ``` ## # A tibble: 13 x 2 ## BikeAge_Gr crash_count ## <chr> <int> ## 1 0-5 60 ## 2 10-Jun 421 ## 3 15-Nov 747 ## 4 16-19 605 ## 5 20-24 680 ## 6 25-29 430 ## 7 30-39 658 ## 8 40-49 920 ## 9 50-59 739 ## 10 60-69 274 ## 11 70 12 ## 12 70+ 58 ## 13 <NA> 112 ``` ] --- ## Careful data scientists clean up their data first! - We're going to need to do some text parsing to clean up these data + `10-Jun` should be `6-10` + `15-Nov` should be `11-15` --- ## Correct and overwrite `mutate` - Remember we want to do the following in the `BikeAge_Gr` variable + `10-Jun` should be `6-10` + `15-Nov` should be `11-15` ```r bike <- bike %>% mutate(BikeAge_Gr = case_when( BikeAge_Gr == "10-Jun" ~ "6-10", BikeAge_Gr == "15-Nov" ~ "11-15", TRUE ~ BikeAge_Gr # everything else )) ``` - Note that we're overwriting existing data and columns, so be careful! + But remember, it's easy to revert if you make a mistake since we didn't touch the raw data, we can always reload it and start over --- ## Check before you move on Always check your changes and confirm code did what you wanted it to do ```r bike %>% group_by(BikeAge_Gr) %>% summarise(count = n()) ``` ``` ## # A tibble: 13 x 2 ## BikeAge_Gr count ## <chr> <int> ## 1 0-5 60 ## 2 11-15 747 ## 3 16-19 605 ## 4 20-24 680 ## 5 25-29 430 ## 6 30-39 658 ## 7 40-49 920 ## 8 50-59 739 ## 9 6-10 421 ## 10 60-69 274 ## 11 70 12 ## 12 70+ 58 ## 13 <NA> 112 ``` --- ## `mutate` to add new variables .question[ How is the new `alcohol` variable determined? ] ```r bike %>% mutate(alcohol = case_when( Bike_Alc_D == "No" & Drvr_Alc_D == "No" ~ "No", Bike_Alc_D == "Yes" | Drvr_Alc_D == "Yes" ~ "Yes", Bike_Alc_D == "Missing" & Drvr_Alc_D == "No" ~ "Missing", Bike_Alc_D == "No" & Drvr_Alc_D == "Missing" ~ "Missing" )) ``` --- ## "Save" when you `mutate` Most often when you define a new variable with `mutate` you'll also want to save the resulting data frame, often by writing over the original data frame. ```r bike <- bike %>% mutate(alcohol = case_when( Bike_Alc_D == "No" & Drvr_Alc_D == "No" ~ "No", Bike_Alc_D == "Yes" | Drvr_Alc_D == "Yes" ~ "Yes", Bike_Alc_D == "Missing" & Drvr_Alc_D == "No" ~ "Missing", Bike_Alc_D == "No" & Drvr_Alc_D == "Missing" ~ "Missing" )) ``` --- ## `transmute` to create a new dataset You'll use this much less often than `mutate` but when you need it, you need it. ```r bike %>% transmute(ID = paste(FID, OBJECTID, sep = "-")) ``` ``` ## # A tibble: 5,716 x 1 ## ID ## <chr> ## 1 18-19 ## 2 29-30 ## 3 33-34 ## 4 35-36 ## 5 49-50 ## 6 53-54 ## 7 56-57 ## 8 60-61 ## 9 63-64 ## 10 66-67 ## # ... with 5,706 more rows ``` --- ## `slice` for certain row numbers First five ```r bike %>% slice(1:5) ``` ``` ## # A tibble: 5 x 54 ## FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir ## <int> <int> <chr> <chr> <int> <chr> <chr> ## 1 18 19 No <NA> 6 No Not Applicable ## 2 29 30 Yes 50-59 51 No With Traffic ## 3 33 34 No <NA> 10 No With Traffic ## 4 35 36 Yes 16-19 17 No <NA> ## 5 49 50 No <NA> 6 No Facing Traffic ## # ... with 47 more variables: Bike_Injur <chr>, Bike_Pos <chr>, ## # Bike_Race <chr>, Bike_Sex <chr>, City <chr>, County <chr>, ## # CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, Crash_Grp <chr>, ## # Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>, ## # Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>, ## # Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>, ## # DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, ## # Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>, ## # ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>, ## # Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>, ## # Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>, ## # Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>, ## # Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr> ``` --- ## `slice` for certain row numbers Last five ```r last_row <- nrow(bike) bike %>% slice((last_row - 4):last_row) ``` ``` ## # A tibble: 5 x 54 ## FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir ## <int> <int> <chr> <chr> <int> <chr> <chr> ## 1 460 461 Yes 6-10 7 No Not Applicable ## 2 474 475 Yes 50-59 50 No With Traffic ## 3 479 480 Yes 16-19 16 No Not Applicable ## 4 487 488 No 40-49 47 Yes With Traffic ## 5 488 489 Yes 30-39 35 No Facing Traffic ## # ... with 47 more variables: Bike_Injur <chr>, Bike_Pos <chr>, ## # Bike_Race <chr>, Bike_Sex <chr>, City <chr>, County <chr>, ## # CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, Crash_Grp <chr>, ## # Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>, ## # Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>, ## # Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>, ## # DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, ## # Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>, ## # ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>, ## # Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>, ## # Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>, ## # Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>, ## # Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr> ``` --- ## `select` to keep only the variables you mention ```r bike %>% select(Crash_Loc, Hit_Run) %>% table() ``` ``` ## Hit_Run ## Crash_Loc No Yes ## Intersection 2223 275 ## Intersection-Related 252 42 ## Location 3 7 ## Non-Intersection 2213 462 ## Non-Roadway 205 30 ``` --- ## or `select` to exclude variables ```r bike %>% select(-OBJECTID) ``` ``` ## # A tibble: 5,716 x 53 ## FID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir Bike_Injur ## <int> <chr> <chr> <int> <chr> <chr> <chr> ## 1 18 No <NA> 6 No Not Appli… C: Possible… ## 2 29 Yes 50-59 51 No With Traf… C: Possible… ## 3 33 No <NA> 10 No With Traf… Injury ## 4 35 Yes 16-19 17 No <NA> B: Evident … ## 5 49 No <NA> 6 No Facing Tr… O: No Injury ## 6 53 Yes 50-59 52 No With Traf… A: Disablin… ## 7 56 Yes 16-19 18 No <NA> C: Possible… ## 8 60 No 40-49 40 No Facing Tr… B: Evident … ## 9 63 Yes 6-10 6 No Facing Tr… B: Evident … ## 10 66 Yes 6-10 7 No <NA> B: Evident … ## # ... with 5,706 more rows, and 46 more variables: Bike_Pos <chr>, ## # Bike_Race <chr>, Bike_Sex <chr>, City <chr>, County <chr>, ## # CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, Crash_Grp <chr>, ## # Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>, ## # Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>, ## # Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>, ## # DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, ## # Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>, ## # ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>, ## # Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>, ## # Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>, ## # Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>, ## # Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr> ``` --- ## or `select` a range of variables ```r bike %>% select(OBJECTID:Bike_Injur) ``` ``` ## # A tibble: 5,716 x 7 ## OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir Bike_Injur ## <int> <chr> <chr> <int> <chr> <chr> <chr> ## 1 19 No <NA> 6 No Not Appl… C: Possib… ## 2 30 Yes 50-59 51 No With Tra… C: Possib… ## 3 34 No <NA> 10 No With Tra… Injury ## 4 36 Yes 16-19 17 No <NA> B: Eviden… ## 5 50 No <NA> 6 No Facing T… O: No Inj… ## 6 54 Yes 50-59 52 No With Tra… A: Disabl… ## 7 57 Yes 16-19 18 No <NA> C: Possib… ## 8 61 No 40-49 40 No Facing T… B: Eviden… ## 9 64 Yes 6-10 6 No Facing T… B: Eviden… ## 10 67 Yes 6-10 7 No <NA> B: Eviden… ## # ... with 5,706 more rows ``` --- ## `pull` to extract a column as a vector ```r bike %>% slice(1:6) %>% pull(Location) ``` ``` ## [1] "36.002743, -78.8785" "35.612984, -77.39265" "35.595676, -77.59074" ## [4] "35.076767, -80.7728" "35.19999, -80.75713" "35.966644, -78.96749" ``` ```r bike %>% slice(1:6) %>% select(Location) ``` ``` ## # A tibble: 6 x 1 ## Location ## <chr> ## 1 36.002743, -78.8785 ## 2 35.612984, -77.39265 ## 3 35.595676, -77.59074 ## 4 35.076767, -80.7728 ## 5 35.19999, -80.75713 ## 6 35.966644, -78.96749 ``` --- ## The two `pull`s in your lives .pull-left[  ] .pull-right[  ] - Don't get `pull` happy when wrangling data! Only extract out variables if you truly need to, otherwise keep in data frame. - But always ⬇️ `Pull` before starting your work when collaborating on GitHub. --- ## `rename` specific columns Useful for correcting typos, and renaming to make variable names shorter and/or more informative - Original names: .small[ ```r names(bike) ``` ``` ## [1] "FID" "OBJECTID" "AmbulanceR" "BikeAge_Gr" "Bike_Age" ## [6] "Bike_Alc_D" "Bike_Dir" "Bike_Injur" "Bike_Pos" "Bike_Race" ## [11] "Bike_Sex" "City" "County" "CrashAlcoh" "CrashDay" ## [16] "Crash_Date" "Crash_Grp" "Crash_Hour" "Crash_Loc" "Crash_Mont" ## [21] "Crash_Time" "Crash_Type" "Crash_Ty_1" "Crash_Year" "Crsh_Sevri" ## [26] "Developmen" "DrvrAge_Gr" "Drvr_Age" "Drvr_Alc_D" "Drvr_EstSp" ## [31] "Drvr_Injur" "Drvr_Race" "Drvr_Sex" "Drvr_VehTy" "ExcsSpdInd" ## [36] "Hit_Run" "Light_Cond" "Locality" "Num_Lanes" "Num_Units" ## [41] "Rd_Charact" "Rd_Class" "Rd_Conditi" "Rd_Config" "Rd_Defects" ## [46] "Rd_Feature" "Rd_Surface" "Region" "Rural_Urba" "Speed_Limi" ## [51] "Traff_Cntr" "Weather" "Workzone_I" "Location" ``` ] - Rename `Speed_Limi` to `Speed_Limit`: ```r bike <- bike %>% rename(Speed_Limit = Speed_Limi) ``` --- ## Check before you move on Always check your changes and confirm code did what you wanted it to do .small[ ```r names(bike) ``` ``` ## [1] "FID" "OBJECTID" "AmbulanceR" "BikeAge_Gr" "Bike_Age" ## [6] "Bike_Alc_D" "Bike_Dir" "Bike_Injur" "Bike_Pos" "Bike_Race" ## [11] "Bike_Sex" "City" "County" "CrashAlcoh" "CrashDay" ## [16] "Crash_Date" "Crash_Grp" "Crash_Hour" "Crash_Loc" "Crash_Mont" ## [21] "Crash_Time" "Crash_Type" "Crash_Ty_1" "Crash_Year" "Crsh_Sevri" ## [26] "Developmen" "DrvrAge_Gr" "Drvr_Age" "Drvr_Alc_D" "Drvr_EstSp" ## [31] "Drvr_Injur" "Drvr_Race" "Drvr_Sex" "Drvr_VehTy" "ExcsSpdInd" ## [36] "Hit_Run" "Light_Cond" "Locality" "Num_Lanes" "Num_Units" ## [41] "Rd_Charact" "Rd_Class" "Rd_Conditi" "Rd_Config" "Rd_Defects" ## [46] "Rd_Feature" "Rd_Surface" "Region" "Rural_Urba" "Speed_Limit" ## [51] "Traff_Cntr" "Weather" "Workzone_I" "Location" ``` ] --- ## `summarise` to reduce variables to values The values are summarised in a data frame ```r bike %>% group_by(BikeAge_Gr) %>% summarise(crash_count = n()) ``` ``` ## # A tibble: 13 x 2 ## BikeAge_Gr crash_count ## <chr> <int> ## 1 0-5 60 ## 2 11-15 747 ## 3 16-19 605 ## 4 20-24 680 ## 5 25-29 430 ## 6 30-39 658 ## 7 40-49 920 ## 8 50-59 739 ## 9 6-10 421 ## 10 60-69 274 ## 11 70 12 ## 12 70+ 58 ## 13 <NA> 112 ``` --- ## and `arrange` to order rows ```r bike %>% group_by(BikeAge_Gr) %>% summarise(crash_count = n()) %>% arrange(desc(crash_count)) ``` ``` ## # A tibble: 13 x 2 ## BikeAge_Gr crash_count ## <chr> <int> ## 1 40-49 920 ## 2 11-15 747 ## 3 50-59 739 ## 4 20-24 680 ## 5 30-39 658 ## 6 16-19 605 ## 7 25-29 430 ## 8 6-10 421 ## 9 60-69 274 ## 10 <NA> 112 ## 11 0-5 60 ## 12 70+ 58 ## 13 70 12 ``` --- ## `count` to group by then count .small[ ```r bike %>% count(BikeAge_Gr) ``` ``` ## # A tibble: 13 x 2 ## BikeAge_Gr n ## <chr> <int> ## 1 0-5 60 ## 2 11-15 747 ## 3 16-19 605 ## 4 20-24 680 ## 5 25-29 430 ## 6 30-39 658 ## 7 40-49 920 ## 8 50-59 739 ## 9 6-10 421 ## 10 60-69 274 ## 11 70 12 ## 12 70+ 58 ## 13 <NA> 112 ``` ] .question[ If you wanted to arrange these in ascending order what would you add to the pipe? ] --- ## Select rows with `sample_n` or `sample_frac` - `sample_n`: randomly sample 5 observations ```r bike_n5 <- bike %>% sample_n(5, replace = FALSE) dim(bike_n5) ``` ``` ## [1] 5 54 ``` - `sample_frac`: randomly sample 20% of observations ```r bike_perc20 <-bike %>% sample_frac(0.2, replace = FALSE) dim(bike_perc20) ``` ``` ## [1] 1143 54 ``` --- ## `distinct` to filter for unique rows ```r bike %>% select(County, City) %>% distinct() %>% arrange(County, City) ``` ``` ## # A tibble: 360 x 2 ## County City ## <chr> <chr> ## 1 Alamance Alamance ## 2 Alamance Burlington ## 3 Alamance Elon College ## 4 Alamance Gibsonville ## 5 Alamance Graham ## 6 Alamance Green Level ## 7 Alamance Mebane ## 8 Alamance None - Rural Crash ## 9 Alexander None - Rural Crash ## 10 Alleghany None - Rural Crash ## # ... with 350 more rows ```