class: center, middle, inverse, title-slide # Data Wrangling ### Dr. Maria Tackett ### 01.28.19 --- layout: true <div class="my-footer"> <span> <a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- ## Announcements - Writing Ex 1 Revision - due **today at 11:59p** - Think of team name if you haven't already --- class: center, middle # Data Wrangling --- ## Bike crashes in NC 2007 - 2014 The dataset is in the **dsbox** package: ```r library(dsbox) ncbikecrash ``` - The dataset contains all North Carolina bike crash data from 2007-2014. - Data downloaded on Sep 6, 2018. --- ### A Grammar of Data Manipulation **dplyr** is based on the concepts of functions as verbs that manipulate data frames. .pull-left[  ] .pull-right[ - `filter`: pick rows matching criteria - `slice`: pick rows using index(es) - `select`: pick columns by name - `pull`: grab a column as a vector - `arrange`: reorder rows ] --- ### A Grammar of Data Manipulation **dplyr** is based on the concepts of functions as verbs that manipulate data frames. .pull-left[  ] .pull-right[ - `mutate`: add new variables - `distinct`: filter for unique rows - `sample_n` / `sample_frac`: randomly sample rows - `summarise`: reduce variables to values - ... (many more) ] --- ## **dplyr** rules for functions - First argument is *always* a data frame - Subsequent arguments say what to do with that data frame - Always return a data frame - Don't modify in place --- ### `filter` to select rows - **One condition** - for crashes in Durham County ```r ncbikecrash %>% filter(county == "Durham") ``` -- - **Many conditions** - for crashes in Durham County where biker was 0-5 years old ```r ncbikecrash %>% filter(county == "Durham", bike_age_group == "0-5") ``` --- ### `select` to select or exclude variables - select variables (columns) ```r ncbikecrash %>% filter(county == "Durham", bike_age_group == "0-5") %>% select(locality, speed_limit) ``` -- - exclude variables ```r ncbikecrash %>% select(-object_id) ``` -- - range of variables ```r ncbikecrash %>% select(city:locality) ``` --- ### `slice` for certain row numbers - First five ```r ncbikecrash %>% slice(1:5) ``` -- - Last five ```r last_row <- nrow(ncbikecrash) ncbikecrash %>% slice((last_row - 4):last_row) ``` --- ### `pull` to extract a column as a vector .small[ ```r ncbikecrash %>% slice(1:6) %>% pull(locality) ``` ``` ## [1] "Rural (<30% Developed)" "Mixed (30% To 70% Developed)" ## [3] "Rural (<30% Developed)" "Urban (>70% Developed)" ## [5] "Urban (>70% Developed)" "Rural (<30% Developed)" ``` ] vs. .small[ ```r ncbikecrash %>% slice(1:6) %>% select(locality) ``` ``` ## # A tibble: 6 x 1 ## locality ## <chr> ## 1 Rural (<30% Developed) ## 2 Mixed (30% To 70% Developed) ## 3 Rural (<30% Developed) ## 4 Urban (>70% Developed) ## 5 Urban (>70% Developed) ## 6 Rural (<30% Developed) ``` ] --- ### `sample_n` / `sample_frac` for a random sample - `sample_n`: randomly sample 5 observations .small[ ```r ncbikecrash_n5 <- ncbikecrash %>% sample_n(5, replace = FALSE) dim(ncbikecrash_n5) ``` ``` ## [1] 5 66 ``` ] - `sample_frac`: randomly sample 20% of observations .small[ ```r ncbikecrash_perc20 <-ncbikecrash %>% sample_frac(0.2, replace = FALSE) dim(ncbikecrash_perc20) ``` ``` ## [1] 1493 66 ``` ] --- ### `distinct` to filter for unique rows And `arrange` to order alphabetically ```r ncbikecrash %>% select(county, city) %>% distinct() %>% arrange(county, city) ``` ``` ## # A tibble: 391 x 2 ## county city ## <chr> <chr> ## 1 Alamance Alamance ## 2 Alamance Burlington ## 3 Alamance Elon ## 4 Alamance Elon College ## 5 Alamance Gibsonville ## 6 Alamance Graham ## 7 Alamance Green Level ## 8 Alamance Mebane ## 9 Alamance None - Rural Crash ## 10 Alexander None - Rural Crash ## # … with 381 more rows ``` --- ### `summarise` to reduce variables to values ```r ncbikecrash %>% summarise(avg_hr = mean(crash_hour)) ``` ``` ## # A tibble: 1 x 1 ## avg_hr ## <dbl> ## 1 14.7 ``` --- ### `group_by` to do calculations on groups ```r ncbikecrash %>% group_by(hit_run) %>% summarise(avg_hr = mean(crash_hour)) ``` ``` ## # A tibble: 2 x 2 ## hit_run avg_hr ## <chr> <dbl> ## 1 No 14.6 ## 2 Yes 15.0 ``` --- ### `count` observations in groups ```r ncbikecrash %>% count(driver_alcohol_drugs) ``` ``` ## # A tibble: 6 x 2 ## driver_alcohol_drugs n ## <chr> <int> ## 1 Missing 99 ## 2 No 695 ## 3 Yes-Alcohol, impairment suspected 12 ## 4 Yes-Alcohol, no impairment detected 3 ## 5 Yes-Drugs, impairment suspected 4 ## 6 <NA> 6654 ``` --- ### `mutate` to add new variables .midi[ ```r ncbikecrash %>% mutate(driver_alcohol_drugs_simplified = case_when( driver_alcohol_drugs == "Missing" ~ NA_character_, str_detect(driver_alcohol_drugs, "Yes") ~ "Yes", TRUE ~ driver_alcohol_drugs )) ``` ] --- ### "Save" when you `mutate` Most often when you define a new variable with `mutate` you'll also want to save the resulting data frame, often by writing over the original data frame. ```r ncbikecrash <- ncbikecrash %>% mutate(driver_alcohol_drugs_simplified = case_when( driver_alcohol_drugs == "Missing" ~ NA_character_, str_detect(driver_alcohol_drugs, "Yes") ~ "Yes", TRUE ~ driver_alcohol_drugs )) ``` --- ### Check before you move on .small[ ```r ncbikecrash %>% count(driver_alcohol_drugs, driver_alcohol_drugs_simplified) ``` ``` ## # A tibble: 6 x 3 ## driver_alcohol_drugs driver_alcohol_drugs_simplified n ## <chr> <chr> <int> ## 1 Missing <NA> 99 ## 2 No No 695 ## 3 Yes-Alcohol, impairment suspected Yes 12 ## 4 Yes-Alcohol, no impairment detected Yes 3 ## 5 Yes-Drugs, impairment suspected Yes 4 ## 6 <NA> <NA> 6654 ``` ] .small[ ```r ncbikecrash %>% count(driver_alcohol_drugs_simplified) ``` ``` ## # A tibble: 3 x 2 ## driver_alcohol_drugs_simplified n ## <chr> <int> ## 1 No 695 ## 2 Yes 19 ## 3 <NA> 6753 ``` ] --- ## AE 04 - NC bike crashes - Copy the NC Bike Crashes project on RStudio Cloud - For each question you work on, set the `eval` chunk option to `TRUE` and knit