Data Wrangling

# Data Wrangling
### Dr. Maria Tackett
### 01.28.19

---

<div class="my-footer">
<span>
<a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a>
</span>
</div>

---

## Announcements

- Writing Ex 1 Revision - due **today at 11:59p**

- Think of team name if you haven't already

---

# Data Wrangling

---

## Bike crashes in NC 2007 - 2014

The dataset is in the **dsbox** package:

```r
library(dsbox)
ncbikecrash
```

- The dataset contains all North Carolina bike crash data from 2007-2014. 
- Data downloaded on Sep 6, 2018.

---

### A Grammar of Data Manipulation

**dplyr** is based on the concepts of functions as verbs that manipulate data frames.

.pull-right[
- `filter`: pick rows matching criteria
- `slice`: pick rows using index(es)
- `select`: pick columns by name
- `pull`: grab a column as a vector
- `arrange`: reorder rows
]

---

### A Grammar of Data Manipulation

**dplyr** is based on the concepts of functions as verbs that manipulate data frames.

.pull-right[
- `mutate`: add new variables
- `distinct`: filter for unique rows
- `sample_n` / `sample_frac`: randomly sample rows
- `summarise`: reduce variables to values
- ... (many more)
]

---

## **dplyr** rules for functions

- First argument is *always* a data frame

- Subsequent arguments say what to do with that data frame

- Always return a data frame

- Don't modify in place

---

### `filter` to select rows

- **One condition**
    - for crashes in Durham County

```r
ncbikecrash %>%
  filter(county == "Durham") 
```

- **Many conditions**
    - for crashes in Durham County where biker was 0-5 years old

```r
ncbikecrash %>%
  filter(county == "Durham", bike_age_group == "0-5")
```

---

### `select` to select or exclude variables

- select variables (columns)

```r
ncbikecrash %>%
  filter(county == "Durham", bike_age_group == "0-5") %>%
  select(locality, speed_limit)
```

- exclude variables

```r
ncbikecrash %>%
  select(-object_id)
```

- range of variables

```r
ncbikecrash %>%
  select(city:locality)
```

---

### `slice` for certain row numbers

- First five

```r
ncbikecrash %>%
  slice(1:5)
```

- Last five

```r
last_row <- nrow(ncbikecrash)
ncbikecrash %>%
  slice((last_row - 4):last_row)
```

---

### `pull` to extract a column as a vector

```r
ncbikecrash %>%
  slice(1:6) %>%
  pull(locality)
```

```
## [1] "Rural (<30% Developed)"       "Mixed (30% To 70% Developed)"
## [3] "Rural (<30% Developed)"       "Urban (>70% Developed)"      
## [5] "Urban (>70% Developed)"       "Rural (<30% Developed)"
```
]

vs.

```r
ncbikecrash %>%
  slice(1:6) %>%
  select(locality)
```

```
## # A tibble: 6 x 1
##   locality                    
##   <chr>                       
## 1 Rural (<30% Developed)      
## 2 Mixed (30% To 70% Developed)
## 3 Rural (<30% Developed)      
## 4 Urban (>70% Developed)      
## 5 Urban (>70% Developed)      
## 6 Rural (<30% Developed)
```
]

---

### `sample_n` / `sample_frac` for a random sample

- `sample_n`: randomly sample 5 observations

```r
ncbikecrash_n5 <- ncbikecrash %>%
  sample_n(5, replace = FALSE)
dim(ncbikecrash_n5)
```

```
## [1]  5 66
```
]

- `sample_frac`: randomly sample 20% of observations

```r
ncbikecrash_perc20 <-ncbikecrash %>%
  sample_frac(0.2, replace = FALSE)
dim(ncbikecrash_perc20)
```

```
## [1] 1493   66
```
]
---

### `distinct` to filter for unique rows

And `arrange` to order alphabetically

```r
ncbikecrash %>% 
  select(county, city) %>% 
  distinct() %>% 
  arrange(county, city)
```

```
## # A tibble: 391 x 2
##    county    city              
##    <chr>     <chr>             
##  1 Alamance  Alamance          
##  2 Alamance  Burlington        
##  3 Alamance  Elon              
##  4 Alamance  Elon College      
##  5 Alamance  Gibsonville       
##  6 Alamance  Graham            
##  7 Alamance  Green Level       
##  8 Alamance  Mebane            
##  9 Alamance  None - Rural Crash
## 10 Alexander None - Rural Crash
## # … with 381 more rows
```

---

### `summarise` to reduce variables to values

```r
ncbikecrash %>%
  summarise(avg_hr = mean(crash_hour))
```

```
## # A tibble: 1 x 1
##   avg_hr
##    <dbl>
## 1   14.7
```

---

### `group_by` to do calculations on groups

```r
ncbikecrash %>%
  group_by(hit_run) %>%
  summarise(avg_hr = mean(crash_hour))
```

```
## # A tibble: 2 x 2
##   hit_run avg_hr
##   <chr>    <dbl>
## 1 No        14.6
## 2 Yes       15.0
```

---

### `count` observations in groups

```r
ncbikecrash %>%
  count(driver_alcohol_drugs)
```

```
## # A tibble: 6 x 2
##   driver_alcohol_drugs                    n
##   <chr>                               <int>
## 1 Missing                                99
## 2 No                                    695
## 3 Yes-Alcohol,  impairment suspected     12
## 4 Yes-Alcohol, no impairment detected     3
## 5 Yes-Drugs, impairment suspected         4
## 6 <NA>                                 6654
```

---

### `mutate` to add new variables

```r
ncbikecrash %>%
  mutate(driver_alcohol_drugs_simplified = case_when(
    driver_alcohol_drugs == "Missing"       ~ NA_character_,
    str_detect(driver_alcohol_drugs, "Yes") ~ "Yes",
    TRUE                                    ~ driver_alcohol_drugs
  ))
```
]
---

### "Save" when you `mutate`

Most often when you define a new variable with `mutate` you'll also want to save the resulting data frame, often by writing over the original data frame.

```r
ncbikecrash <- ncbikecrash %>%
  mutate(driver_alcohol_drugs_simplified = case_when(
    driver_alcohol_drugs == "Missing"       ~ NA_character_,
    str_detect(driver_alcohol_drugs, "Yes") ~ "Yes",
    TRUE                                    ~ driver_alcohol_drugs
  ))
```

---

### Check before you move on

```r
ncbikecrash %>% 
  count(driver_alcohol_drugs, driver_alcohol_drugs_simplified)
```

```
## # A tibble: 6 x 3
##   driver_alcohol_drugs                driver_alcohol_drugs_simplified     n
##   <chr>                               <chr>                           <int>
## 1 Missing                             <NA>                               99
## 2 No                                  No                                695
## 3 Yes-Alcohol,  impairment suspected  Yes                                12
## 4 Yes-Alcohol, no impairment detected Yes                                 3
## 5 Yes-Drugs, impairment suspected     Yes                                 4
## 6 <NA>                                <NA>                             6654
```
]

```r
ncbikecrash %>% 
  count(driver_alcohol_drugs_simplified)
```

```
## # A tibble: 3 x 2
##   driver_alcohol_drugs_simplified     n
##   <chr>                           <int>
## 1 No                                695
## 2 Yes                                19
## 3 <NA>                             6753
```
]
---

## AE 04 - NC bike crashes

- Copy the NC Bike Crashes project on RStudio Cloud

- For each question you work on, set the `eval` chunk option to `TRUE` and knit