Tidy data and data wrangling

# Tidy data and data wrangling
### Dr. Çetinkaya-Rundel
### 2018-01-24

---

## Announcements

- Class recordings linked on course page -- I'll email the link as well
- Look out for an email about your team tonight, start thinking about a team name asap
-

---

# Tidy data

---

## Tidy data

>Happy families are all alike; every unhappy family is unhappy in its own way. 
>
>Leo Tolstoy

1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
]
--
.pull-right[
**Characteristics of untidy data:**

!@#$%^&*()
]

---

## Summary tables

.small[
.pull-left[
<div id="htmlwidget-241424ffe8df408cc6d7" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-241424ffe8df408cc6d7">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51","52","53","54","55","56","57","58","59","60","61","62","63","64","65","66","67","68","69","70","71","72","73","74","75","76","77","78","79","80","81","82","83","84","85","86","87"],["Luke Skywalker","C-3PO","R2-D2","Darth Vader","Leia Organa","Owen Lars","Beru Whitesun lars","R5-D4","Biggs Darklighter","Obi-Wan Kenobi","Anakin Skywalker","Wilhuff Tarkin","Chewbacca","Han Solo","Greedo","Jabba Desilijic Tiure","Wedge Antilles","Jek Tono Porkins","Yoda","Palpatine","Boba Fett","IG-88","Bossk","Lando Calrissian","Lobot","Ackbar","Mon Mothma","Arvel Crynyd","Wicket Systri Warrick","Nien Nunb","Qui-Gon Jinn","Nute Gunray","Finis Valorum","Jar Jar Binks","Roos Tarpals","Rugor Nass","Ric Olié","Watto","Sebulba","Quarsh Panaka","Shmi Skywalker","Darth Maul","Bib Fortuna","Ayla Secura","Dud Bolt","Gasgano","Ben Quadinaros","Mace Windu","Ki-Adi-Mundi","Kit Fisto","Eeth Koth","Adi Gallia","Saesee Tiin","Yarael Poof","Plo Koon","Mas Amedda","Gregar Typho","Cordé","Cliegg Lars","Poggle the Lesser","Luminara Unduli","Barriss Offee","Dormé","Dooku","Bail Prestor Organa","Jango Fett","Zam Wesell","Dexter Jettster","Lama Su","Taun We","Jocasta Nu","Ratts Tyerell","R4-P17","Wat Tambor","San Hill","Shaak Ti","Grievous","Tarfful","Raymus Antilles","Sly Moore","Tion Medon","Finn","Rey","Poe Dameron","BB8","Captain Phasma","Padmé Amidala"],[172,167,96,202,150,178,165,97,183,182,188,180,228,180,173,175,170,180,66,170,183,200,190,177,175,180,150,null,88,160,193,191,170,196,224,206,183,137,112,183,163,175,180,178,94,122,163,188,198,196,171,184,188,264,188,196,185,157,183,183,170,166,165,193,191,183,168,198,229,213,167,79,96,193,191,178,216,234,188,178,206,null,null,null,null,null,165],[77,75,32,136,49,120,75,32,84,77,84,null,112,80,74,1358,77,110,17,75,78.2,140,113,79,79,83,null,null,20,68,89,90,null,66,82,null,null,null,40,null,null,80,null,55,45,null,65,84,82,87,null,50,null,null,80,null,85,null,null,80,56.2,50,null,80,null,79,55,102,88,null,null,15,null,48,null,57,159,136,79,48,80,null,null,null,null,null,45]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>name<\/th>\n      <th>height<\/th>\n      <th>mass<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"pt","pageLength":8,"columnDefs":[{"className":"dt-right","targets":[2,3]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[8,10,25,50,100]}},"evals":[],"jsHooks":[]}</script>
]
.pull-right[
<div id="htmlwidget-9b3bf55d4896708c26da" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-9b3bf55d4896708c26da">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49"],["Alderaan","Aleen Minor","Bespin","Bestine IV","Cato Neimoidia","Cerea","Champala","Chandrila","Concord Dawn","Corellia","Coruscant","Dathomir","Dorin","Endor","Eriadu","Geonosis","Glee Anselm","Haruun Kal","Iktotch","Iridonia","Kalee","Kamino","Kashyyyk","Malastare","Mirial","Mon Cala","Muunilinst","Naboo","Nal Hutta","Ojom","Quermia","Rodia","Ryloth","Serenno","Shili","Skako","Socorro","Stewjon","Sullust","Tatooine","Toydaria","Trandosha","Troiken","Tund","Umbara","Utapau","Vulpter","Zolan",null],[1,null,null,null,null,null,null,1,null,null,2,null,null,null,null,null,null,null,null,null,null,1,null,null,2,null,null,3,null,null,null,null,1,null,1,null,null,null,null,2,null,null,null,null,1,null,null,1,3],[null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,1,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null],[2,1,1,1,1,1,1,null,1,2,1,1,1,1,1,1,1,1,1,1,1,2,2,1,null,1,1,7,null,1,1,1,1,1,null,1,1,1,1,6,1,1,1,1,null,1,1,null,5],[null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,2],[null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,1,null,null,null,null,null,null,null,null,null,null,null,2,null,null,null,null,null,null,null,null,null]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>homeworld<\/th>\n      <th>female<\/th>\n      <th>hermaphrodite<\/th>\n      <th>male<\/th>\n      <th>none<\/th>\n      <th><NA><\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"pt","pageLength":8,"columnDefs":[{"className":"dt-right","targets":[2,3,4,5,6]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[8,10,25,50,100]}},"evals":[],"jsHooks":[]}</script>
]
]

---

# Pipes

---

## Where does the name come from?

The pipe operator is implemented in the package **magrittr**.

---

## Review: How does a pipe work?

- You can think about the following sequence of actions - find key, 
unlock car, start car, drive to school, park.

- Expressed as a set of nested functions in R pseudocode this would look like:

```r
park(drive(start_car(find("keys")), to = "campus"))
```

- Writing it out using pipes give it a more natural (and easier to read) 
structure:

```r
find("keys") %>%
  start_car() %>%
  drive(to = "campus") %>%
  park()
```

---

## What about other arguments?

To send results to a function argument other than first one or to use the previous result for multiple arguments, use `.`:

```r
starwars %>%
  filter(species == "Human") %>%
  lm(mass ~ height, data = .)
```

```
## 
## Call:
## lm(formula = mass ~ height, data = .)
## 
## Coefficients:
## (Intercept)       height  
##     -116.58         1.11
```

---

# Data wrangling

---

## NC DOT Fatal Crashes in North Carolina

From https://opendurham.nc.gov

```r
bike <- read_csv2("https://stat.duke.edu/~mc301/data/nc_bike_crash.csv", 
                  na = c("NA", "", "."))
```

---

## Variables

View the names of variables via

```r
names(bike)
```

```
##  [1] "FID"        "OBJECTID"   "AmbulanceR" "BikeAge_Gr" "Bike_Age"  
##  [6] "Bike_Alc_D" "Bike_Dir"   "Bike_Injur" "Bike_Pos"   "Bike_Race" 
## [11] "Bike_Sex"   "City"       "County"     "CrashAlcoh" "CrashDay"  
## [16] "Crash_Date" "Crash_Grp"  "Crash_Hour" "Crash_Loc"  "Crash_Mont"
## [21] "Crash_Time" "Crash_Type" "Crash_Ty_1" "Crash_Year" "Crsh_Sevri"
## [26] "Developmen" "DrvrAge_Gr" "Drvr_Age"   "Drvr_Alc_D" "Drvr_EstSp"
## [31] "Drvr_Injur" "Drvr_Race"  "Drvr_Sex"   "Drvr_VehTy" "ExcsSpdInd"
## [36] "Hit_Run"    "Light_Cond" "Locality"   "Num_Lanes"  "Num_Units" 
## [41] "Rd_Charact" "Rd_Class"   "Rd_Conditi" "Rd_Config"  "Rd_Defects"
## [46] "Rd_Feature" "Rd_Surface" "Region"     "Rural_Urba" "Speed_Limi"
## [51] "Traff_Cntr" "Weather"    "Workzone_I" "Location"
```

and see detailed descriptions at https://stat.duke.edu/~mc301/data/nc_bike_crash.html.

---

## Want to follow along?

Go to RStudio Cloud and make a copy of "NC Bike Crashes - Wrangling".

---

## Viewing your data

- In the Environment, click on the name of the data frame to view
it in the data viewer

- Use the `glimpse` function to take a peek

```r
glimpse(bike)
```

```
## Observations: 5,716
## Variables: 54
## $ FID        <int> 18, 29, 33, 35, 49, 53, 56, 60, 63, 66, 72, 75, 82,...
## $ OBJECTID   <int> 19, 30, 34, 36, 50, 54, 57, 61, 64, 67, 73, 76, 83,...
## $ AmbulanceR <chr> "No", "Yes", "No", "Yes", "No", "Yes", "Yes", "No",...
## $ BikeAge_Gr <chr> NA, "50-59", NA, "16-19", NA, "50-59", "16-19", "40...
## $ Bike_Age   <int> 6, 51, 10, 17, 6, 52, 18, 40, 6, 7, 45, 30, 17, 20,...
## $ Bike_Alc_D <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No...
## $ Bike_Dir   <chr> "Not Applicable", "With Traffic", "With Traffic", N...
## $ Bike_Injur <chr> "C: Possible Injury", "C: Possible Injury", "Injury...
## $ Bike_Pos   <chr> "Driveway / Alley", "Travel Lane", "Travel Lane", "...
## $ Bike_Race  <chr> "Black", "Black", "Black", "White", "Black", "White...
## $ Bike_Sex   <chr> "Female", "Male", "Male", "Male", "Male", "Male", "...
## $ City       <chr> "Durham", "Greenville", "Farmville", "Charlotte", "...
## $ County     <chr> "Durham", "Pitt", "Pitt", "Mecklenburg", "Mecklenbu...
## $ CrashAlcoh <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No...
## $ CrashDay   <chr> "01-01-06", "01-01-02", "01-01-07", "01-01-05", NA,...
## $ Crash_Date <date> 2007-01-06, 2007-01-09, 2007-01-14, 2007-01-12, 20...
## $ Crash_Grp  <chr> "Bicyclist Failed to Yield - Midblock", "Crossing P...
## $ Crash_Hour <int> 13, 23, 16, 19, 12, 20, 19, 14, 16, 0, 17, 18, 14, ...
## $ Crash_Loc  <chr> "Non-Intersection", "Intersection-Related", "Inters...
## $ Crash_Mont <chr> NA, NA, NA, NA, NA, "01-04-01", "01-04-01", NA, "01...
## $ Crash_Time <dttm> 0001-01-01 13:17:58, 0001-01-01 23:08:58, 0001-01-...
## $ Crash_Type <chr> "Bicyclist Ride Out - Residential Driveway", "Cross...
## $ Crash_Ty_1 <int> 353311, 211180, 111144, 119139, 112114, 311231, 119...
## $ Crash_Year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200...
## $ Crsh_Sevri <chr> "C: Possible Injury", "C: Possible Injury", "O: No ...
## $ Developmen <chr> "Residential", "Commercial", "Residential", "Reside...
## $ DrvrAge_Gr <chr> "60-69", "30-39", "50-59", "30-39", NA, "20-24", "4...
## $ Drvr_Age   <int> 66, 34, 52, 33, NA, 20, 40, NA, 17, 51, NA, 64, 50,...
## $ Drvr_Alc_D <chr> "No", "No", "No", "No", "Missing", "No", "No", "Mis...
## $ Drvr_EstSp <chr> "11-15 mph", "0-5 mph", "21-25 mph", "46-50 mph", "...
## $ Drvr_Injur <chr> "O: No Injury", "O: No Injury", "O: No Injury", "O:...
## $ Drvr_Race  <chr> "Black", "Black", "White", "White", "/Missing", "Wh...
## $ Drvr_Sex   <chr> "Male", "Male", "Female", "Female", NA, "Female", "...
## $ Drvr_VehTy <chr> "Pickup", "Passenger Car", "Passenger Car", "Sport ...
## $ ExcsSpdInd <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No...
## $ Hit_Run    <chr> "No", "No", "No", "No", "Yes", "No", "No", "Yes", "...
## $ Light_Cond <chr> "Daylight", "Dark - Lighted Roadway", "Daylight", "...
## $ Locality   <chr> "Mixed (30% To 70% Developed)", "Urban (>70% Develo...
## $ Num_Lanes  <chr> "2 lanes", "5 lanes", "2 lanes", "4 lanes", "2 lane...
## $ Num_Units  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Rd_Charact <chr> "Straight - Level", "Straight - Level", "Straight -...
## $ Rd_Class   <chr> "Local Street", "Local Street", "Local Street", "NC...
## $ Rd_Conditi <chr> "Dry", "Dry", "Dry", "Dry", "Dry", "Dry", "Dry", "D...
## $ Rd_Config  <chr> "Two-Way, Not Divided", "Two-Way, Divided, Unprotec...
## $ Rd_Defects <chr> "None", "None", "None", "None", "None", "None", "No...
## $ Rd_Feature <chr> "No Special Feature", "Four-Way Intersection", "Fou...
## $ Rd_Surface <chr> "Smooth Asphalt", "Smooth Asphalt", "Smooth Asphalt...
## $ Region     <chr> "Piedmont", "Coastal", "Coastal", "Piedmont", "Pied...
## $ Rural_Urba <chr> "Urban", "Urban", "Rural", "Urban", "Urban", "Urban...
## $ Speed_Limi <chr> "20 - 25  MPH", "40 - 45  MPH", "30 - 35  MPH", "40...
## $ Traff_Cntr <chr> "No Control Present", "Stop And Go Signal", "Stop S...
## $ Weather    <chr> "Clear", "Clear", "Clear", "Cloudy", "Clear", "Clea...
## $ Workzone_I <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No...
## $ Location   <chr> "36.002743, -78.8785", "35.612984, -77.39265", "35....
```

---

## A Grammar of Data Manipulation

**dplyr** is based on the concepts of functions as verbs that manipulate data frames.

Single data frame functions / verbs:

* `filter`: pick rows matching criteria
* `slice`: pick rows using index(es)
* `select`: pick columns by name
* `pull`: grab a column as a vector
* `rename`: rename specific columns
* `arrange`: reorder rows
* `mutate`: add new variables
* `transmute`: create new data frame with variables
* `distinct`: filter for unique rows
* `sample_n` / `sample_frac`: randomly sample rows
* `summarise`: reduce variables to values
* ... (many more)

---

## **dplyr** rules for functions

1. First argument is *always* a data frame

2. Subsequent arguments say what to do with that data frame

3. Always return a data frame

4. Don't modify in place

5. Performance via lazy evaluation

---

## A note on piping and layering

- The `%>%` operator in **dplyr** functions is called the pipe operator. This means you "pipe" the output of the previous line of code as the first input of the next line of code.

- The `+` operator in **ggplot2** functions is used for "layering". This means you create the plot in layers, separated by `+`.

---

## Filter rows with `filter`

- Select a subset of rows in a data frame.
- Easily filter for many conditions at once.

---

## `filter`

for crashes in Durham County

```r
bike %>%
  filter(County == "Durham")
```

```
## # A tibble: 253 x 54
##      FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir      
##    <int>    <int> <chr>      <chr>         <int> <chr>      <chr>         
##  1    18       19 No         <NA>              6 No         Not Applicable
##  2    53       54 Yes        50-59            52 No         With Traffic  
##  3    56       57 Yes        16-19            18 No         <NA>          
##  4   209      210 No         16-19            16 No         Facing Traffic
##  5   228      229 Yes        40-49            40 No         With Traffic  
##  6   620      621 Yes        50-59            55 No         With Traffic  
##  7   667      668 Yes        60-69            61 No         Not Applicable
##  8   458      459 Yes        60-69            62 No         With Traffic  
##  9   576      577 No         40-49            49 No         With Traffic  
## 10   618      619 No         20-24            23 No         With Traffic  
## # ... with 243 more rows, and 47 more variables: Bike_Injur <chr>,
## #   Bike_Pos <chr>, Bike_Race <chr>, Bike_Sex <chr>, City <chr>,
## #   County <chr>, CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>,
## #   Crash_Grp <chr>, Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>,
## #   Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>,
## #   Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>,
## #   DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>,
## #   Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>,
## #   ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>,
## #   Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>,
## #   Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>,
## #   Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>,
## #   Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>
```

---

## `filter`

for crashes in Durham County where biker was < 10 yrs old

```r
bike %>%
  filter(County == "Durham", Bike_Age < 10)
```

```
## # A tibble: 20 x 54
##      FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir      
##    <int>    <int> <chr>      <chr>         <int> <chr>      <chr>         
##  1    18       19 No         <NA>              6 No         Not Applicable
##  2    47       48 No         10-Jun            9 No         Not Applicable
##  3   124      125 Yes        10-Jun            8 No         With Traffic  
##  4   531      532 Yes        10-Jun            7 No         With Traffic  
##  5   704      705 Yes        10-Jun            9 No         Not Applicable
##  6    42       43 No         10-Jun            8 No         With Traffic  
##  7   392      393 Yes        0-5               2 No         Not Applicable
##  8   941      942 No         10-Jun            9 No         With Traffic  
##  9   436      437 Yes        10-Jun            6 No         Not Applicable
## 10   160      161 Yes        10-Jun            7 No         With Traffic  
## 11   273      274 Yes        10-Jun            7 No         Facing Traffic
## 12    78       79 Yes        10-Jun            7 No         With Traffic  
## 13   422      423 No         10-Jun            9 No         Not Applicable
## 14   570      571 No         <NA>              0 Missing    Not Applicable
## 15   683      684 Yes        10-Jun            8 No         Not Applicable
## 16    62       63 Yes        10-Jun            7 No         With Traffic  
## 17   248      249 No         0-5               4 No         Not Applicable
## 18   306      307 Yes        10-Jun            8 No         With Traffic  
## 19   231      232 Yes        10-Jun            8 No         With Traffic  
## 20   361      362 Yes        10-Jun            9 No         With Traffic  
## # ... with 47 more variables: Bike_Injur <chr>, Bike_Pos <chr>,
## #   Bike_Race <chr>, Bike_Sex <chr>, City <chr>, County <chr>,
## #   CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, Crash_Grp <chr>,
## #   Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>,
## #   Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>,
## #   Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>,
## #   DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>,
## #   Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>,
## #   ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>,
## #   Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>,
## #   Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>,
## #   Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>,
## #   Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>
```

---

## Commonly used logical operators in R

operator    | definition                   || operator     | definition
------------|------------------------------||--------------|----------------
`<`         | less than                    ||`x`&nbsp;&#124;&nbsp;`y`     | `x` OR `y` 
`<=`        |	less than or equal to        ||`is.na(x)`    | test if `x` is `NA`
`>`         | greater than                 ||`!is.na(x)`   | test if `x` is not `NA`
`>=`        |	greater than or equal to     ||`x %in% y`    | test if `x` is in `y`
`==`        |	exactly equal to             ||`!(x %in% y)` | test if `x` is not in `y`
`!=`        |	not equal to                 ||`!x`          | not `x`
`x & y`     | `x` AND `y`                  ||              |

---

## Aside: real data is messy!

```r
bike %>%
  group_by(BikeAge_Gr) %>%
  summarise(crash_count = n())
```

```
## # A tibble: 13 x 2
##    BikeAge_Gr crash_count
##    <chr>            <int>
##  1 0-5                 60
##  2 10-Jun             421
##  3 15-Nov             747
##  4 16-19              605
##  5 20-24              680
##  6 25-29              430
##  7 30-39              658
##  8 40-49              920
##  9 50-59              739
## 10 60-69              274
## 11 70                  12
## 12 70+                 58
## 13 <NA>               112
```
]

---

## Careful data scientists clean up their data first!

- We're going to need to do some text parsing to clean up
these data
    + `10-Jun` should be `6-10`
    + `15-Nov` should be `11-15`

---

## Correct and overwrite `mutate`

- Remember we want to do the following in the `BikeAge_Gr` variable
    + `10-Jun` should be `6-10`
    + `15-Nov` should be `11-15`

```r
bike <- bike %>%
  mutate(BikeAge_Gr = case_when(
    BikeAge_Gr == "10-Jun" ~ "6-10",
    BikeAge_Gr == "15-Nov" ~ "11-15",
    TRUE                   ~ BikeAge_Gr     # everything else
  ))
```

- Note that we're overwriting existing data and columns, so be careful!
    + But remember, it's easy to revert if you make a mistake since we didn't
    touch the raw data, we can always reload it and start over

---

## Check before you move on

Always check your changes and confirm code did what you wanted it to do

```r
bike %>%
  group_by(BikeAge_Gr) %>%
  summarise(count = n())
```

```
## # A tibble: 13 x 2
##    BikeAge_Gr count
##    <chr>      <int>
##  1 0-5           60
##  2 11-15        747
##  3 16-19        605
##  4 20-24        680
##  5 25-29        430
##  6 30-39        658
##  7 40-49        920
##  8 50-59        739
##  9 6-10         421
## 10 60-69        274
## 11 70            12
## 12 70+           58
## 13 <NA>         112
```

---

## `mutate` to add new variables

```r
bike %>%
  mutate(alcohol = case_when(
    Bike_Alc_D == "No" & Drvr_Alc_D == "No"      ~ "No",
    Bike_Alc_D == "Yes" | Drvr_Alc_D == "Yes"    ~ "Yes",
    Bike_Alc_D == "Missing" & Drvr_Alc_D == "No" ~ "Missing",
    Bike_Alc_D == "No" & Drvr_Alc_D == "Missing" ~ "Missing"
  ))
```

---

## "Save" when you `mutate`

Most often when you define a new variable with `mutate` you'll also want to save the resulting data frame, often by writing over the original data frame.

```r
bike <- bike %>%
  mutate(alcohol = case_when(
    Bike_Alc_D == "No" & Drvr_Alc_D == "No"      ~ "No",
    Bike_Alc_D == "Yes" | Drvr_Alc_D == "Yes"    ~ "Yes",
    Bike_Alc_D == "Missing" & Drvr_Alc_D == "No" ~ "Missing",
    Bike_Alc_D == "No" & Drvr_Alc_D == "Missing" ~ "Missing"
  ))
```

---

## `transmute` to create a new dataset

You'll use this much less often than `mutate` but when you need it, you need it.

```r
bike %>% 
  transmute(ID = paste(FID, OBJECTID, sep = "-"))
```

```
## # A tibble: 5,716 x 1
##    ID   
##    <chr>
##  1 18-19
##  2 29-30
##  3 33-34
##  4 35-36
##  5 49-50
##  6 53-54
##  7 56-57
##  8 60-61
##  9 63-64
## 10 66-67
## # ... with 5,706 more rows
```

---

## `slice` for certain row numbers

First five

```r
bike %>%
  slice(1:5)
```

```
## # A tibble: 5 x 54
##     FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir      
##   <int>    <int> <chr>      <chr>         <int> <chr>      <chr>         
## 1    18       19 No         <NA>              6 No         Not Applicable
## 2    29       30 Yes        50-59            51 No         With Traffic  
## 3    33       34 No         <NA>             10 No         With Traffic  
## 4    35       36 Yes        16-19            17 No         <NA>          
## 5    49       50 No         <NA>              6 No         Facing Traffic
## # ... with 47 more variables: Bike_Injur <chr>, Bike_Pos <chr>,
## #   Bike_Race <chr>, Bike_Sex <chr>, City <chr>, County <chr>,
## #   CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, Crash_Grp <chr>,
## #   Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>,
## #   Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>,
## #   Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>,
## #   DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>,
## #   Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>,
## #   ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>,
## #   Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>,
## #   Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>,
## #   Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>,
## #   Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>
```

---

## `slice` for certain row numbers

Last five

```r
last_row <- nrow(bike)
bike %>%
  slice((last_row - 4):last_row)
```

```
## # A tibble: 5 x 54
##     FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir      
##   <int>    <int> <chr>      <chr>         <int> <chr>      <chr>         
## 1   460      461 Yes        6-10              7 No         Not Applicable
## 2   474      475 Yes        50-59            50 No         With Traffic  
## 3   479      480 Yes        16-19            16 No         Not Applicable
## 4   487      488 No         40-49            47 Yes        With Traffic  
## 5   488      489 Yes        30-39            35 No         Facing Traffic
## # ... with 47 more variables: Bike_Injur <chr>, Bike_Pos <chr>,
## #   Bike_Race <chr>, Bike_Sex <chr>, City <chr>, County <chr>,
## #   CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, Crash_Grp <chr>,
## #   Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>,
## #   Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>,
## #   Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>,
## #   DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>,
## #   Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>,
## #   ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>,
## #   Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>,
## #   Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>,
## #   Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>,
## #   Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>
```

---

## `select` to keep only the variables you mention

```r
bike %>%
  select(Crash_Loc, Hit_Run) %>%
  table()
```

```
##                       Hit_Run
## Crash_Loc                No  Yes
##   Intersection         2223  275
##   Intersection-Related  252   42
##   Location                3    7
##   Non-Intersection     2213  462
##   Non-Roadway           205   30
```

---

## or `select` to exclude variables

```r
bike %>%
  select(-OBJECTID)
```

```
## # A tibble: 5,716 x 53
##      FID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir   Bike_Injur  
##    <int> <chr>      <chr>         <int> <chr>      <chr>      <chr>       
##  1    18 No         <NA>              6 No         Not Appli… C: Possible…
##  2    29 Yes        50-59            51 No         With Traf… C: Possible…
##  3    33 No         <NA>             10 No         With Traf… Injury      
##  4    35 Yes        16-19            17 No         <NA>       B: Evident …
##  5    49 No         <NA>              6 No         Facing Tr… O: No Injury
##  6    53 Yes        50-59            52 No         With Traf… A: Disablin…
##  7    56 Yes        16-19            18 No         <NA>       C: Possible…
##  8    60 No         40-49            40 No         Facing Tr… B: Evident …
##  9    63 Yes        6-10              6 No         Facing Tr… B: Evident …
## 10    66 Yes        6-10              7 No         <NA>       B: Evident …
## # ... with 5,706 more rows, and 46 more variables: Bike_Pos <chr>,
## #   Bike_Race <chr>, Bike_Sex <chr>, City <chr>, County <chr>,
## #   CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, Crash_Grp <chr>,
## #   Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>,
## #   Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>,
## #   Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>,
## #   DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>,
## #   Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>,
## #   ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>,
## #   Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>,
## #   Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>,
## #   Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>,
## #   Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>
```

---

## or `select` a range of variables

```r
bike %>%
  select(OBJECTID:Bike_Injur)
```

```
## # A tibble: 5,716 x 7
##    OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir  Bike_Injur
##       <int> <chr>      <chr>         <int> <chr>      <chr>     <chr>     
##  1       19 No         <NA>              6 No         Not Appl… C: Possib…
##  2       30 Yes        50-59            51 No         With Tra… C: Possib…
##  3       34 No         <NA>             10 No         With Tra… Injury    
##  4       36 Yes        16-19            17 No         <NA>      B: Eviden…
##  5       50 No         <NA>              6 No         Facing T… O: No Inj…
##  6       54 Yes        50-59            52 No         With Tra… A: Disabl…
##  7       57 Yes        16-19            18 No         <NA>      C: Possib…
##  8       61 No         40-49            40 No         Facing T… B: Eviden…
##  9       64 Yes        6-10              6 No         Facing T… B: Eviden…
## 10       67 Yes        6-10              7 No         <NA>      B: Eviden…
## # ... with 5,706 more rows
```

---

## `pull` to extract a column as a vector

```r
bike %>%
  slice(1:6) %>%
  pull(Location)
```

```
## [1] "36.002743, -78.8785"  "35.612984, -77.39265" "35.595676, -77.59074"
## [4] "35.076767, -80.7728"  "35.19999, -80.75713"  "35.966644, -78.96749"
```

```r
bike %>%
  slice(1:6) %>%
  select(Location)
```

```
## # A tibble: 6 x 1
##   Location            
##   <chr>               
## 1 36.002743, -78.8785 
## 2 35.612984, -77.39265
## 3 35.595676, -77.59074
## 4 35.076767, -80.7728 
## 5 35.19999, -80.75713 
## 6 35.966644, -78.96749
```

---

## The two `pull`s in your lives

- Don't get `pull` happy when wrangling data! Only extract out variables if you truly need to, otherwise keep in data frame.

- But always ⬇️ `Pull` before starting your work when collaborating on GitHub.

---

## `rename` specific columns

Useful for correcting typos, and renaming to make variable names shorter 
and/or more informative

- Original names:
.small[

```r
names(bike)
```

- Rename `Speed_Limi` to `Speed_Limit`:

```r
bike <- bike %>%
  rename(Speed_Limit = Speed_Limi)
```

---

## Check before you move on

Always check your changes and confirm code did what you wanted it to do

```r
names(bike)
```

```
##  [1] "FID"         "OBJECTID"    "AmbulanceR"  "BikeAge_Gr"  "Bike_Age"   
##  [6] "Bike_Alc_D"  "Bike_Dir"    "Bike_Injur"  "Bike_Pos"    "Bike_Race"  
## [11] "Bike_Sex"    "City"        "County"      "CrashAlcoh"  "CrashDay"   
## [16] "Crash_Date"  "Crash_Grp"   "Crash_Hour"  "Crash_Loc"   "Crash_Mont" 
## [21] "Crash_Time"  "Crash_Type"  "Crash_Ty_1"  "Crash_Year"  "Crsh_Sevri" 
## [26] "Developmen"  "DrvrAge_Gr"  "Drvr_Age"    "Drvr_Alc_D"  "Drvr_EstSp" 
## [31] "Drvr_Injur"  "Drvr_Race"   "Drvr_Sex"    "Drvr_VehTy"  "ExcsSpdInd" 
## [36] "Hit_Run"     "Light_Cond"  "Locality"    "Num_Lanes"   "Num_Units"  
## [41] "Rd_Charact"  "Rd_Class"    "Rd_Conditi"  "Rd_Config"   "Rd_Defects" 
## [46] "Rd_Feature"  "Rd_Surface"  "Region"      "Rural_Urba"  "Speed_Limit"
## [51] "Traff_Cntr"  "Weather"     "Workzone_I"  "Location"
```
]

---

## `summarise` to reduce variables to values

The values are summarised in a data frame

```r
bike %>%
  group_by(BikeAge_Gr) %>%
  summarise(crash_count = n())
```

```
## # A tibble: 13 x 2
##    BikeAge_Gr crash_count
##    <chr>            <int>
##  1 0-5                 60
##  2 11-15              747
##  3 16-19              605
##  4 20-24              680
##  5 25-29              430
##  6 30-39              658
##  7 40-49              920
##  8 50-59              739
##  9 6-10               421
## 10 60-69              274
## 11 70                  12
## 12 70+                 58
## 13 <NA>               112
```

---

## and `arrange` to order rows

```r
bike %>%
  group_by(BikeAge_Gr) %>%
  summarise(crash_count = n()) %>%
  arrange(desc(crash_count))
```

```
## # A tibble: 13 x 2
##    BikeAge_Gr crash_count
##    <chr>            <int>
##  1 40-49              920
##  2 11-15              747
##  3 50-59              739
##  4 20-24              680
##  5 30-39              658
##  6 16-19              605
##  7 25-29              430
##  8 6-10               421
##  9 60-69              274
## 10 <NA>               112
## 11 0-5                 60
## 12 70+                 58
## 13 70                  12
```

---

## `count` to group by then count

```r
bike %>%
  count(BikeAge_Gr)
```

```
## # A tibble: 13 x 2
##    BikeAge_Gr     n
##    <chr>      <int>
##  1 0-5           60
##  2 11-15        747
##  3 16-19        605
##  4 20-24        680
##  5 25-29        430
##  6 30-39        658
##  7 40-49        920
##  8 50-59        739
##  9 6-10         421
## 10 60-69        274
## 11 70            12
## 12 70+           58
## 13 <NA>         112
```
]

---

## Select rows with `sample_n` or `sample_frac`

- `sample_n`: randomly sample 5 observations

```r
bike_n5 <- bike %>%
  sample_n(5, replace = FALSE)
dim(bike_n5)
```

```
## [1]  5 54
```

- `sample_frac`: randomly sample 20% of observations

```r
bike_perc20 <-bike %>%
  sample_frac(0.2, replace = FALSE)
dim(bike_perc20)
```

```
## [1] 1143   54
```

---

## `distinct` to filter for unique rows

```r
bike %>% 
  select(County, City) %>% 
  distinct() %>% 
  arrange(County, City)
```

```
## # A tibble: 360 x 2
##    County    City              
##    <chr>     <chr>             
##  1 Alamance  Alamance          
##  2 Alamance  Burlington        
##  3 Alamance  Elon College      
##  4 Alamance  Gibsonville       
##  5 Alamance  Graham            
##  6 Alamance  Green Level       
##  7 Alamance  Mebane            
##  8 Alamance  None - Rural Crash
##  9 Alexander None - Rural Crash
## 10 Alleghany None - Rural Crash
## # ... with 350 more rows
```