Tidy data and data wrangling

# Tidy data and data wrangling
### Dr. Maria Tackett
### 01.23.19

---

<div class="my-footer">
<span>
<a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a>
</span>
</div>

---

## Eli Review Signup:

1. Go to [https://app.elireview.com](https://app.elireview.com). 
2. Click the “New to Eli Review? Sign up!” button to create your account. 
    - Be sure to choose the “Student” option. 
    - Consider using your school email, but any address will work.
    
Join our course: In the box labeled “Ready to join a course?” enter this course code: **hyper425bend**

---

## Check in

- Any questions on material from last time?

- Any questions on the lab?

- Any questions on workflow / course structure?

---

# Identifying variables

---

## Number of variables involved

- .vocab[Univariate data analysis]: distribution of single variable

- .vocab[Bivariate data analysis]: relationship between two variables

- .vocab[Multivariate data analysis]: relationship between many variables at once, usually focusing on the relationship between two while conditioning for others

---

## Types of variables

- .vocab[Numerical variables] can be classified as .vocab[continuous] or .vocab[discrete] based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively.
    - *height* is continuous
    - *number of siblings* is discrete

- If the variable is .vocab[categorical], we can determine if it is .vocab[ordinal] based on whether or not the levels have a natural ordering.
    - *hair color* is unordered 
    - *year in school* is ordinal

---

# Visualizing numerical data

---

### Describing shapes of numerical distributions

- **shape:**
    - skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail)
    - modality: unimodal, bimodal, multimodal, uniform
    
- **center:** mean (**`mean`**), median (**`median`**), mode (not always useful)

- **spread:** range (**`range`**), standard deviation (**`sd`**), inter-quartile range (**`IQR`**)

- **outliers**: observations outside of the usual pattern

---

## Histograms

```r
ggplot(data = starwars, mapping = aes(x = height)) +
  geom_histogram(binwidth = 10)
```

```
## Warning: Removed 6 rows containing non-finite values (stat_bin).
```

<img src="2b-tidy-data-wrangle_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" />
]

---

## Density plots

```r
ggplot(data = starwars, mapping = aes(x = height)) +
  geom_density()
```

```
## Warning: Removed 6 rows containing non-finite values (stat_density).
```

<img src="2b-tidy-data-wrangle_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" />
]

---

## Side-by-side box plots

```r
ggplot(data = starwars, mapping = aes(y = height, x = gender)) +
  geom_boxplot()
```

```
## Warning: Removed 6 rows containing non-finite values (stat_boxplot).
```

<img src="2b-tidy-data-wrangle_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" />
]

---

## Visualizing categorical data

---

## Bar plots

```r
ggplot(data = starwars, mapping = aes(x = gender)) +
  geom_bar()
```

<img src="2b-tidy-data-wrangle_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" />
]

---

### Segmented bar plots, counts

```r
ggplot(data = starwars, mapping = aes(x = gender, fill = hair_color)) +
  geom_bar()
```

<img src="2b-tidy-data-wrangle_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" />
]

---

### Segmented bar plots, proportions

```r
ggplot(data = starwars, mapping = aes(x = gender, fill = hair_color)) +
* geom_bar(position = "fill") +
  labs(y = "proportion")
```

<img src="2b-tidy-data-wrangle_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" />
]

---

.question[
Which bar plot is a more useful representation for visualizing the relationship between gender and hair color? Why?
]

![](2b-tidy-data-wrangle_files/figure-html/unnamed-chunk-7-1.png)
]

![](2b-tidy-data-wrangle_files/figure-html/unnamed-chunk-8-1.png)

]

---

# Tidy data

---

## Tidy data

>Happy families are all alike; every unhappy family is unhappy in its own way. 
>
>Leo Tolstoy

- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
]

!@#$%^&*()
]

---

## Summary tables

```
## # A tibble: 87 x 3
##    name               height  mass
##    <chr>               <int> <dbl>
##  1 Luke Skywalker        172    77
##  2 C-3PO                 167    75
##  3 R2-D2                  96    32
##  4 Darth Vader           202   136
##  5 Leia Organa           150    49
##  6 Owen Lars             178   120
##  7 Beru Whitesun lars    165    75
##  8 R5-D4                  97    32
##  9 Biggs Darklighter     183    84
## 10 Obi-Wan Kenobi        182    77
## # … with 77 more rows
```
]

```
## # A tibble: 5 x 2
##   gender        avg_height
##   <chr>              <dbl>
## 1 female              165.
## 2 hermaphrodite       175 
## 3 male                179.
## 4 none                200 
## 5 <NA>                120
```
]
]

---

# Pipes

---

## Where does the name come from?

The pipe operator is implemented in the package **magrittr**, it's pronounced 
"and then".

.small[
[https://en.wikipedia.org/wiki/The_Treachery_of_Images](https://en.wikipedia.org/wiki/The_Treachery_of_Images)
]

---

## Review: How does a pipe work?

- You can think about the following sequence of actions - find key, 
unlock car, start car, drive to school, park.
- Expressed as a set of nested functions in R pseudocode this would look like:

```r
park(drive(start_car(find("keys")), to = "campus"))
```

- Writing it out using pipes give it a more natural (and easier to read) 
structure:

```r
find("keys") %>%
  start_car() %>%
  drive(to = "campus") %>%
  park()
```

---

## What about other arguments?

To send results to a function argument other than first one or to use the previous result for multiple arguments, use "`.`"

```r
starwars %>%
  filter(species == "Human") %>%
  lm(mass ~ height, data = .)
```

```
## 
## Call:
## lm(formula = mass ~ height, data = .)
## 
## Coefficients:
## (Intercept)       height  
##     -116.58         1.11
```

---

# Data wrangling

---

## Bike crashes in NC 2007 - 2014

The dataset is in the **dsbox** package:

```r
library(dsbox)
ncbikecrash
```

- The dataset contains all North Carolina bike crash data from 2007-2014. 
- Data downloaded on Sep 6, 2018.

---

## Variables

View the names of variables via

```r
names(ncbikecrash)
```

```
##  [1] "object_id"            "city"                 "county"              
##  [4] "region"               "development"          "locality"            
##  [7] "on_road"              "rural_urban"          "speed_limit"         
## [10] "traffic_control"      "weather"              "workzone"            
## [13] "bike_age"             "bike_age_group"       "bike_alcohol"        
## [16] "bike_alcohol_drugs"   "bike_direction"       "bike_injury"         
## [19] "bike_position"        "bike_race"            "bike_sex"            
## [22] "driver_age"           "driver_age_group"     "driver_alcohol"      
## [25] "driver_alcohol_drugs" "driver_est_speed"     "driver_injury"       
## [28] "driver_race"          "driver_sex"           "driver_vehicle_type" 
## [31] "crash_alcohol"        "crash_date"           "crash_day"           
## [34] "crash_group"          "crash_hour"           "crash_location"      
## [37] "crash_month"          "crash_severity"       "crash_time"          
## [40] "crash_type"           "crash_year"           "ambulance_req"       
## [43] "hit_run"              "light_condition"      "road_character"      
## [46] "road_class"           "road_condition"       "road_configuration"  
## [49] "road_defects"         "road_feature"         "road_surface"        
## [52] "num_bikes_ai"         "num_bikes_bi"         "num_bikes_ci"        
## [55] "num_bikes_ki"         "num_bikes_no"         "num_bikes_to"        
## [58] "num_bikes_ui"         "num_lanes"            "num_units"           
## [61] "distance_mi_from"     "frm_road"             "rte_invd_cd"         
## [64] "towrd_road"           "geo_point"            "geo_shape"
```
]

and see detailed descriptions with `?ncbikecrash`.

---

## Viewing your data

- In the Environment, after loading with `data(ncbikecrash)`, and click on the 
name of the data frame to view it in the data viewer
- Use the `glimpse` function to take a peek

```r
glimpse(ncbikecrash)
```

```
## Observations: 7,467
## Variables: 66
## $ object_id            <int> 1686, 1674, 1673, 1687, 1653, 1665, 1642, 1…
## $ city                 <chr> "None - Rural Crash", "Henderson", "None - …
## $ county               <chr> "Wayne", "Vance", "Lincoln", "Columbus", "N…
## $ region               <chr> "Coastal", "Piedmont", "Piedmont", "Coastal…
## $ development          <chr> "Farms, Woods, Pastures", "Residential", "F…
## $ locality             <chr> "Rural (<30% Developed)", "Mixed (30% To 70…
## $ on_road              <chr> "SR 1915", "NICHOLAS ST", "US 321", "W BURK…
## $ rural_urban          <chr> "Rural", "Urban", "Rural", "Urban", "Urban"…
## $ speed_limit          <chr> "50 - 55  MPH", "30 - 35  MPH", "50 - 55  M…
## $ traffic_control      <chr> "No Control Present", "Stop Sign", "Double …
## $ weather              <chr> "Clear", "Clear", "Clear", "Rain", "Clear",…
## $ workzone             <chr> "No", "No", "No", "No", "No", "No", "No", "…
## $ bike_age             <chr> "52", "66", "33", "52", "22", "15", "41", "…
## $ bike_age_group       <chr> "50-59", "60-69", "30-39", "50-59", "20-24"…
## $ bike_alcohol         <chr> "No", "No", "No", "Yes", "No", "No", "No", …
## $ bike_alcohol_drugs   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ bike_direction       <chr> "With Traffic", "With Traffic", "With Traff…
## $ bike_injury          <chr> "B: Evident Injury", "C: Possible Injury", …
## $ bike_position        <chr> "Bike Lane / Paved Shoulder", "Travel Lane"…
## $ bike_race            <chr> "Black", "Black", "White", "Black", "White"…
## $ bike_sex             <chr> "Male", "Male", "Male", "Male", "Female", "…
## $ driver_age           <chr> "34", NA, "37", "55", "25", "17", NA, "50",…
## $ driver_age_group     <chr> "30-39", NA, "30-39", "50-59", "25-29", "0-…
## $ driver_alcohol       <chr> "No", "Missing", "No", "No", "No", "No", "M…
## $ driver_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ driver_est_speed     <chr> "51-55 mph", "6-10 mph", "41-45 mph", "11-1…
## $ driver_injury        <chr> "O: No Injury", "Unknown Injury", "O: No In…
## $ driver_race          <chr> "White", "Unknown/Missing", "Hispanic", "Bl…
## $ driver_sex           <chr> "Male", NA, "Female", "Male", "Male", "Fema…
## $ driver_vehicle_type  <chr> "Single Unit Truck (2-Axle, 6-Tire)", NA, "…
## $ crash_alcohol        <chr> "No", "No", "No", "Yes", "No", "No", "No", …
## $ crash_date           <chr> "11DEC2013", "20NOV2013", "03NOV2013", "14D…
## $ crash_day            <chr> "Wednesday", "Wednesday", "Sunday", "Saturd…
## $ crash_group          <chr> "Motorist Overtaking Bicyclist", "Bicyclist…
## $ crash_hour           <int> 6, 20, 18, 18, 13, 17, 17, 7, 15, 2, 12, 22…
## $ crash_location       <chr> "Non-Intersection", "Intersection", "Non-In…
## $ crash_month          <chr> "December", "November", "November", "Decemb…
## $ crash_severity       <chr> "B: Evident Injury", "C: Possible Injury", …
## $ crash_time           <time> 06:10:00, 20:41:00, 18:05:00, 18:34:00, 13…
## $ crash_type           <chr> "Motorist Overtaking - Undetected Bicyclist…
## $ crash_year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ ambulance_req        <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Y…
## $ hit_run              <chr> "No", "Yes", "No", "No", "No", "No", "Yes",…
## $ light_condition      <chr> "Dark - Roadway Not Lighted", NA, "Dark - R…
## $ road_character       <chr> "Straight - Level", "Straight - Level", "St…
## $ road_class           <chr> "State Secondary Route", "Local Street", "U…
## $ road_condition       <chr> "Dry", "Dry", "Dry", "Water (Standing, Movi…
## $ road_configuration   <chr> "Two-Way, Not Divided", "Two-Way, Divided, …
## $ road_defects         <chr> "None", NA, "None", "None", "None", "None",…
## $ road_feature         <chr> "No Special Feature", "T-Intersection", "No…
## $ road_surface         <chr> "Coarse Asphalt", "Smooth Asphalt", "Smooth…
## $ num_bikes_ai         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_bi         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_ci         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_ki         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_no         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_to         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_ui         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_lanes            <chr> "2 lanes", "2 lanes", "2 lanes", "1 lane", …
## $ num_units            <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ distance_mi_from     <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0"…
## $ frm_road             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ rte_invd_cd          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ towrd_road           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ geo_point            <chr> "35.3336070056, -77.9955023901", "36.315187…
## $ geo_shape            <chr> "{\"type\": \"Point\", \"coordinates\": [-7…
```

---

### A Grammar of Data Manipulation

**dplyr** is based on the concepts of functions as verbs that manipulate data frames.

.pull-right[
.tiny[
- `filter`: pick rows matching criteria
- `slice`: pick rows using index(es)
- `select`: pick columns by name
- `pull`: grab a column as a vector
- `arrange`: reorder rows
]
]

---

### A Grammar of Data Manipulation

**dplyr** is based on the concepts of functions as verbs that manipulate data frames.

.pull-right[
.tiny[
- `mutate`: add new variables
- `distinct`: filter for unique rows
- `sample_n` / `sample_frac`: randomly sample rows
- `summarise`: reduce variables to values
- ... (many more)
]
]

---

## **dplyr** rules for functions

- First argument is *always* a data frame

- Subsequent arguments say what to do with that data frame

- Always return a data frame

- Don't modify in place

---

## A note on piping and layering

- The `%>%` operator in **dplyr** functions is called the pipe operator. This means you "pipe" the output of the previous line of code as the first input of the next line of code.
<br>

- The `+` operator in **ggplot2** functions is used for "layering". This means you create the plot in layers, separated by `+`.

---

### `filter` to select a subset of rows

for crashes in Durham County

```r
ncbikecrash %>%
* filter(county == "Durham")
```

```
## # A tibble: 340 x 66
##    object_id city  county region development locality on_road rural_urban
##        <int> <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>      
##  1      2452 Durh… Durham Piedm… Residential Urban (… <NA>    Urban      
##  2      2441 Durh… Durham Piedm… Commercial  Urban (… <NA>    Urban      
##  3      2466 Durh… Durham Piedm… Commercial  Urban (… <NA>    Urban      
##  4       549 Durh… Durham Piedm… Residential Urban (… PARK A… Urban      
##  5       598 Durh… Durham Piedm… Residential Urban (… BELT S… Urban      
##  6       603 Durh… Durham Piedm… Residential Urban (… HINSON… Urban      
##  7      3974 Durh… Durham Piedm… Commercial  Urban (… <NA>    Urban      
##  8      7134 Durh… Durham Piedm… Commercial  Urban (… <NA>    Urban      
##  9      1670 Durh… Durham Piedm… Commercial  Urban (… INFINI… Urban      
## 10      1773 Durh… Durham Piedm… Residential Urban (… <NA>    Urban      
## # … with 330 more rows, and 58 more variables: speed_limit <chr>,
## #   traffic_control <chr>, weather <chr>, workzone <chr>, bike_age <chr>,
## #   bike_age_group <chr>, bike_alcohol <chr>, bike_alcohol_drugs <chr>,
## #   bike_direction <chr>, bike_injury <chr>, bike_position <chr>,
## #   bike_race <chr>, bike_sex <chr>, driver_age <chr>,
## #   driver_age_group <chr>, driver_alcohol <chr>,
## #   driver_alcohol_drugs <chr>, driver_est_speed <chr>,
## #   driver_injury <chr>, driver_race <chr>, driver_sex <chr>,
## #   driver_vehicle_type <chr>, crash_alcohol <chr>, crash_date <chr>,
## #   crash_day <chr>, crash_group <chr>, crash_hour <int>,
## #   crash_location <chr>, crash_month <chr>, crash_severity <chr>,
## #   crash_time <time>, crash_type <chr>, crash_year <int>,
## #   ambulance_req <chr>, hit_run <chr>, light_condition <chr>,
## #   road_character <chr>, road_class <chr>, road_condition <chr>,
## #   road_configuration <chr>, road_defects <chr>, road_feature <chr>,
## #   road_surface <chr>, num_bikes_ai <int>, num_bikes_bi <int>,
## #   num_bikes_ci <int>, num_bikes_ki <int>, num_bikes_no <int>,
## #   num_bikes_to <int>, num_bikes_ui <int>, num_lanes <chr>,
## #   num_units <int>, distance_mi_from <chr>, frm_road <chr>,
## #   rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, geo_shape <chr>
```

---

### `filter` for many conditions at once

for crashes in Durham County where biker was 0-5 years old

```r
ncbikecrash %>%
  filter(county == "Durham", bike_age_group == "0-5")
```

```
## # A tibble: 4 x 66
##   object_id city  county region development locality on_road rural_urban
##       <int> <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>      
## 1      4062 Durh… Durham Piedm… Residential Urban (… <NA>    Urban      
## 2       414 Durh… Durham Piedm… Residential Urban (… PVA 90… Urban      
## 3      3016 Durh… Durham Piedm… Residential Urban (… <NA>    Urban      
## 4      1383 Durh… Durham Piedm… Residential Urban (… PVA 62… Urban      
## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>,
## #   weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## #   bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## #   bike_injury <chr>, bike_position <chr>, bike_race <chr>,
## #   bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,
## #   driver_alcohol <chr>, driver_alcohol_drugs <chr>,
## #   driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,
## #   driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,
## #   crash_date <chr>, crash_day <chr>, crash_group <chr>,
## #   crash_hour <int>, crash_location <chr>, crash_month <chr>,
## #   crash_severity <chr>, crash_time <time>, crash_type <chr>,
## #   crash_year <int>, ambulance_req <chr>, hit_run <chr>,
## #   light_condition <chr>, road_character <chr>, road_class <chr>,
## #   road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## #   road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,
## #   num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,
## #   num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,
## #   num_lanes <chr>, num_units <int>, distance_mi_from <chr>,
## #   frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,
## #   geo_shape <chr>
```

---

### Logical operators in R

operator    | definition                   || operator     | definition
------------|------------------------------||--------------|----------------
`<`         | less than                    ||`x`&nbsp;&#124;&nbsp;`y`     | `x` OR `y` 
`<=`        |	less than or equal to        ||`is.na(x)`    | test if `x` is `NA`
`>`         | greater than                 ||`!is.na(x)`   | test if `x` is not `NA`
`>=`        |	greater than or equal to     ||`x %in% y`    | test if `x` is in `y`
`==`        |	exactly equal to             ||`!(x %in% y)` | test if `x` is not in `y`
`!=`        |	not equal to                 ||`!x`          | not `x`
`x & y`     | `x` AND `y`                  ||              |

---

### `select` to keep variables

```r
ncbikecrash %>%
  filter(county == "Durham", bike_age_group == "0-5") %>%
  select(locality, speed_limit)
```

```
## # A tibble: 4 x 2
##   locality               speed_limit 
##   <chr>                  <chr>       
## 1 Urban (>70% Developed) 30 - 35  MPH
## 2 Urban (>70% Developed) 5 - 15 MPH  
## 3 Urban (>70% Developed) 20 - 25  MPH
## 4 Urban (>70% Developed) 20 - 25  MPH
```

---

### `select` to exclude variables

```r
ncbikecrash %>%
  select(-object_id)
```

```
## # A tibble: 7,467 x 65
##    city  county region development locality on_road rural_urban speed_limit
##    <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>       <chr>      
##  1 None… Wayne  Coast… Farms, Woo… Rural (… SR 1915 Rural       50 - 55  M…
##  2 Hend… Vance  Piedm… Residential Mixed (… NICHOL… Urban       30 - 35  M…
##  3 None… Linco… Piedm… Farms, Woo… Rural (… US 321  Rural       50 - 55  M…
##  4 Whit… Colum… Coast… Commercial  Urban (… W BURK… Urban       30 - 35  M…
##  5 Wilm… New H… Coast… Residential Urban (… RACINE… Urban       <NA>       
##  6 None… Robes… Coast… Farms, Woo… Rural (… SR 1513 Rural       50 - 55  M…
##  7 None… Richm… Piedm… Residential Mixed (… SR 1903 Rural       30 - 35  M…
##  8 Rale… Wake   Piedm… Commercial  Urban (… PERSON… Urban       30 - 35  M…
##  9 Whit… Colum… Coast… Residential Rural (… FLOWER… Urban       30 - 35  M…
## 10 New … Craven Coast… Residential Urban (… SUTTON… Urban       20 - 25  M…
## # … with 7,457 more rows, and 57 more variables: traffic_control <chr>,
## #   weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## #   bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## #   bike_injury <chr>, bike_position <chr>, bike_race <chr>,
## #   bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,
## #   driver_alcohol <chr>, driver_alcohol_drugs <chr>,
## #   driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,
## #   driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,
## #   crash_date <chr>, crash_day <chr>, crash_group <chr>,
## #   crash_hour <int>, crash_location <chr>, crash_month <chr>,
## #   crash_severity <chr>, crash_time <time>, crash_type <chr>,
## #   crash_year <int>, ambulance_req <chr>, hit_run <chr>,
## #   light_condition <chr>, road_character <chr>, road_class <chr>,
## #   road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## #   road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,
## #   num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,
## #   num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,
## #   num_lanes <chr>, num_units <int>, distance_mi_from <chr>,
## #   frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,
## #   geo_shape <chr>
```

---

### `select` a range of variables

```r
ncbikecrash %>%
  select(city:locality)
```

```
## # A tibble: 7,467 x 5
##    city           county     region  development       locality            
##    <chr>          <chr>      <chr>   <chr>             <chr>               
##  1 None - Rural … Wayne      Coastal Farms, Woods, Pa… Rural (<30% Develop…
##  2 Henderson      Vance      Piedmo… Residential       Mixed (30% To 70% D…
##  3 None - Rural … Lincoln    Piedmo… Farms, Woods, Pa… Rural (<30% Develop…
##  4 Whiteville     Columbus   Coastal Commercial        Urban (>70% Develop…
##  5 Wilmington     New Hanov… Coastal Residential       Urban (>70% Develop…
##  6 None - Rural … Robeson    Coastal Farms, Woods, Pa… Rural (<30% Develop…
##  7 None - Rural … Richmond   Piedmo… Residential       Mixed (30% To 70% D…
##  8 Raleigh        Wake       Piedmo… Commercial        Urban (>70% Develop…
##  9 Whiteville     Columbus   Coastal Residential       Rural (<30% Develop…
## 10 New Bern       Craven     Coastal Residential       Urban (>70% Develop…
## # … with 7,457 more rows
```

---

### `slice` for certain row numbers

First five

```r
ncbikecrash %>%
  slice(1:5)
```

```
## # A tibble: 5 x 66
##   object_id city  county region development locality on_road rural_urban
##       <int> <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>      
## 1      1686 None… Wayne  Coast… Farms, Woo… Rural (… SR 1915 Rural      
## 2      1674 Hend… Vance  Piedm… Residential Mixed (… NICHOL… Urban      
## 3      1673 None… Linco… Piedm… Farms, Woo… Rural (… US 321  Rural      
## 4      1687 Whit… Colum… Coast… Commercial  Urban (… W BURK… Urban      
## 5      1653 Wilm… New H… Coast… Residential Urban (… RACINE… Urban      
## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>,
## #   weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## #   bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## #   bike_injury <chr>, bike_position <chr>, bike_race <chr>,
## #   bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,
## #   driver_alcohol <chr>, driver_alcohol_drugs <chr>,
## #   driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,
## #   driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,
## #   crash_date <chr>, crash_day <chr>, crash_group <chr>,
## #   crash_hour <int>, crash_location <chr>, crash_month <chr>,
## #   crash_severity <chr>, crash_time <time>, crash_type <chr>,
## #   crash_year <int>, ambulance_req <chr>, hit_run <chr>,
## #   light_condition <chr>, road_character <chr>, road_class <chr>,
## #   road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## #   road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,
## #   num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,
## #   num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,
## #   num_lanes <chr>, num_units <int>, distance_mi_from <chr>,
## #   frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,
## #   geo_shape <chr>
```

---

### `slice` for certain row numbers

Last five

```r
last_row <- nrow(ncbikecrash)
ncbikecrash %>%
  slice((last_row - 4):last_row)
```

```
## # A tibble: 5 x 66
##   object_id city  county region development locality on_road rural_urban
##       <int> <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>      
## 1      6989 High… Guilf… Piedm… Residential Urban (… <NA>    Urban      
## 2      6991 Wilm… New H… Coast… Residential Urban (… <NA>    Urban      
## 3      6995 Kins… Lenoir Coast… Commercial  Urban (… <NA>    Urban      
## 4      6998 Faye… Cumbe… Coast… Residential Urban (… <NA>    Urban      
## 5      7000 None… Onslow Coast… Farms, Woo… Rural (… <NA>    Rural      
## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>,
## #   weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## #   bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## #   bike_injury <chr>, bike_position <chr>, bike_race <chr>,
## #   bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,
## #   driver_alcohol <chr>, driver_alcohol_drugs <chr>,
## #   driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,
## #   driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,
## #   crash_date <chr>, crash_day <chr>, crash_group <chr>,
## #   crash_hour <int>, crash_location <chr>, crash_month <chr>,
## #   crash_severity <chr>, crash_time <time>, crash_type <chr>,
## #   crash_year <int>, ambulance_req <chr>, hit_run <chr>,
## #   light_condition <chr>, road_character <chr>, road_class <chr>,
## #   road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## #   road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,
## #   num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,
## #   num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,
## #   num_lanes <chr>, num_units <int>, distance_mi_from <chr>,
## #   frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,
## #   geo_shape <chr>
```

---

### `pull` to extract a column as a vector

```r
ncbikecrash %>%
  slice(1:6) %>%
  pull(locality)
```

```
## [1] "Rural (<30% Developed)"       "Mixed (30% To 70% Developed)"
## [3] "Rural (<30% Developed)"       "Urban (>70% Developed)"      
## [5] "Urban (>70% Developed)"       "Rural (<30% Developed)"
```

vs.

```r
ncbikecrash %>%
  slice(1:6) %>%
  select(locality)
```

```
## # A tibble: 6 x 1
##   locality                    
##   <chr>                       
## 1 Rural (<30% Developed)      
## 2 Mixed (30% To 70% Developed)
## 3 Rural (<30% Developed)      
## 4 Urban (>70% Developed)      
## 5 Urban (>70% Developed)      
## 6 Rural (<30% Developed)
```

---

### `sample_n` / `sample_frac` for a random sample

- `sample_n`: randomly sample 5 observations

```r
ncbikecrash_n5 <- ncbikecrash %>%
  sample_n(5, replace = FALSE)
dim(ncbikecrash_n5)
```

```
## [1]  5 66
```

- `sample_frac`: randomly sample 20% of observations

```r
ncbikecrash_perc20 <-ncbikecrash %>%
  sample_frac(0.2, replace = FALSE)
dim(ncbikecrash_perc20)
```

```
## [1] 1493   66
```

---

### `distinct` to filter for unique rows

And `arrange` to order alphabetically

```r
ncbikecrash %>% 
  select(county, city) %>% 
  distinct() %>% 
  arrange(county, city)
```

```
## # A tibble: 391 x 2
##    county    city              
##    <chr>     <chr>             
##  1 Alamance  Alamance          
##  2 Alamance  Burlington        
##  3 Alamance  Elon              
##  4 Alamance  Elon College      
##  5 Alamance  Gibsonville       
##  6 Alamance  Graham            
##  7 Alamance  Green Level       
##  8 Alamance  Mebane            
##  9 Alamance  None - Rural Crash
## 10 Alexander None - Rural Crash
## # … with 381 more rows
```

---

### `summarise` to reduce variables to values

```r
ncbikecrash %>%
  summarise(avg_hr = mean(crash_hour))
```

```
## # A tibble: 1 x 1
##   avg_hr
##    <dbl>
## 1   14.7
```

---

### `group_by` to do calculations on groups

```r
ncbikecrash %>%
  group_by(hit_run) %>%
  summarise(avg_hr = mean(crash_hour))
```

```
## # A tibble: 2 x 2
##   hit_run avg_hr
##   <chr>    <dbl>
## 1 No        14.6
## 2 Yes       15.0
```

---

### `count` observations in groups

```r
ncbikecrash %>%
  count(driver_alcohol_drugs)
```

```
## # A tibble: 6 x 2
##   driver_alcohol_drugs                    n
##   <chr>                               <int>
## 1 Missing                                99
## 2 No                                    695
## 3 Yes-Alcohol,  impairment suspected     12
## 4 Yes-Alcohol, no impairment detected     3
## 5 Yes-Drugs, impairment suspected         4
## 6 <NA>                                 6654
```

---

### `mutate` to add new variables

```r
ncbikecrash %>%
  mutate(driver_alcohol_drugs_simplified = case_when(
    driver_alcohol_drugs == "Missing"       ~ NA,
    str_detect(driver_alcohol_drugs, "Yes") ~ "Yes",
    TRUE                                    ~ "No"
  ))
```
]
---

### "Save" when you `mutate`

Most often when you define a new variable with `mutate` you'll also want to save the resulting data frame, often by writing over the original data frame.

```r
ncbikecrash <- ncbikecrash %>%
  mutate(driver_alcohol_drugs_simplified = case_when(
    str_detect(driver_alcohol_drugs, "Yes") ~ "Yes",
    TRUE                                    ~ driver_alcohol_drugs
  ))
```

---

### Check before you move on

```r
ncbikecrash %>% 
  count(driver_alcohol_drugs, driver_alcohol_drugs_simplified)
```

```
## # A tibble: 6 x 3
##   driver_alcohol_drugs                driver_alcohol_drugs_simplified     n
##   <chr>                               <chr>                           <int>
## 1 Missing                             Missing                            99
## 2 No                                  No                                695
## 3 Yes-Alcohol,  impairment suspected  Yes                                12
## 4 Yes-Alcohol, no impairment detected Yes                                 3
## 5 Yes-Drugs, impairment suspected     Yes                                 4
## 6 <NA>                                <NA>                             6654
```

```r
ncbikecrash %>% 
  count(driver_alcohol_drugs_simplified)
```

```
## # A tibble: 4 x 2
##   driver_alcohol_drugs_simplified     n
##   <chr>                           <int>
## 1 Missing                            99
## 2 No                                695
## 3 Yes                                19
## 4 <NA>                             6654
```

---

## AE 04 - NC bike crashes

- Copy the NC Bike Crashes project on RStudio Cloud

- For each question you work on, set the `eval` chunk option to `TRUE` and knit

---

## Before next class

- You will get your teams in lab tomorrow!