---
title: "Package dplyr"
subtitle: "Statistical Computing & Programming"
author: "Shawn Santo"
institute: ""
date: "05-21-20"
output:
  xaringan::moon_reader:
    css: "slides.css"
    lib_dir: libs
    nature:
      highlightStyle: github
      highlightLines: true
      countIncrementalSlides: false
editor_options: 
  chunk_output_type: console
---

```{r include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE,
                      comment = "#>", highlight = TRUE,
                      fig.align = "center")
```

## Supplementary materials

Companion videos

- [Introduction to `tidyverse`, `dplyr` and the pipe operator](https://warpwire.duke.edu/w/0cgDAA/)
- [Core `dplyr` functions](https://warpwire.duke.edu/w/08gDAA/)
- [More `dplyr`: `summarise()`, `group_by()`, and `case_when()`](https://warpwire.duke.edu/w/1cgDAA/)

Additional resources


- `dplyr` [cheat sheet](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf)
- [dplyr vignette](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html)
- [Chapter 5](https://r4ds.had.co.nz/transform.html), R for Data Science

---

## Getting started

```{r}
library(tidyverse)
```

```r
── Attaching packages ─────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.2.1     ✔ purrr   0.3.2
✔ tibble  2.1.3     ✔ dplyr   0.8.3
✔ tidyr   0.8.3     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.4.0
── Conflicts ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
```

Also, load `nycflights13`.

```{r}
library(nycflights13)
```

---

class: inverse, center, middle

# Pipes

---

## Pipes in R

Infix function `%>%` is a forward-pipe operator. It allows you to pipe an object
forward into a function or call expression.

You can think about the following sequence of actions - 
find keys, unlock car, start car, drive to school, park.

<br/>

Expressed as a set of nested functions in R pseudo code this would look like:

```{r eval=FALSE}
park(drive(start_car(unlock_car(find("keys"))), to = "campus"))
```

<br/>

Writing it out using pipes give it a more natural (and easier to read) structure:

```{r eval=FALSE}
find("keys") %>%
  unlock_car() %>% 
  start_car() %>%
  drive(to = "campus") %>%
  park()
```

---

## Approaches

All of the following are fine, it comes down to personal preference.

Nested:

```{r, eval=FALSE}
h(g(f(x), y = 1), z = 1)
```

Piped:

```{r, eval=FALSE}
f(x) %>% 
  g(y = 1) %>% 
  h(z = 1)
```

Intermediate:

```{r, eval=FALSE}
res <- f(x)
res <- g(res, y = 1)
res <- h(res, z = 1)
```

---

## What about other arguments?

By default, the object on the left-hand side of ` %>% ` is placed as the value
to the first argument in the function on the right-hand side of ` %>% `.

To pass the value to other arguments a `.` is used. For example,

--

```{r}
data.frame(a = 1:3, b = 3:1) %>% lm(a ~ b, data = .)
```

--

```{r}
data.frame(a = 1:3, b = 3:1) %>% .[[1]]
```

--

```{r}
data.frame(a = 1:3, b = 3:1) %>% .[[length(.)]]
```

---
class: inverse, center, middle

# Data wrangling with `dplyr`

---

## A grammar of data manipulation

Package `dplyr` is based on the concepts of functions as verbs 
that manipulate data frames.

Single data frame functions / verbs:

.small-text[
| Function                       | Description                          |
|--------------------------------|--------------------------------------|
| `filter()`                     | pick rows matching criteria          |
| `slice()`                      | pick rows using indices              |
| `select()`                     | pick columns by name                 |
| `pull()`                       | grab a column as a vector            |
| `rename()`                     | rename specific columns              |
| `arrange()`                    | reorder rows                         |
| `mutate()`                     | add new variables                    |
| `transmute()`                  | create new data frame with variables |
| `distinct()`                   | filter for unique rows               |
| `sample_n()` / `sample_frac()` | randomly sample rows                 |
| `summarise()`                  | reduce variables to values           |
]

... many more

---

## `dplyr` rules

1. First argument is *always* a data frame

2. Subsequent arguments say what to do with that data frame

3. Almost always returns a data frame

4. Doesn't modify in place

<br/><br/>

Based on rules 1 and 3, it is natural to apply ` %>% ` in a sequence
of `dplyr` functions for data wrangling purposes.

---

## Data

We will demonstrate `dplyr`'s functionality using the `nycflights13` package.

```{r}
flights
```

---

## `filter()` - March flights

```{r}
flights %>% 
  filter(month == 3)
```

---

## `filter()` - flights in the first 7 days of March

```{r}
flights %>% 
  filter(month == 3, day <= 7)
```

---

## `filter()` - flights to LAX *or* RDU in March

```{r}
flights %>%
  filter(dest == "LAX" | dest == "RDU", month == 3)
```

---

## `slice()` - first 10 flights

```{r}
flights %>% 
  slice(1:10)
```

---

## `slice()` - last 5 flights

```{r}
flights %>% 
  slice((n() - 4):n())
```

---

## `select()` - specific variables

```{r}
flights %>% 
  select(year, month, day)
```

---

## `select()` - exclude variables

```{r}
flights %>% 
  select(-year, -month, -day)
```

---

## `select()` - ranges

```{r}
flights %>% 
  select(year:day)
```

---

## `select()` - exclude ranges

```{r}
flights %>% 
  select(-(year:day))
```

---

## `select()` - matching

```{r}
flights %>% 
  select(contains("dep"), contains("arr"))
```

---

```{r}
flights %>% 
  select(starts_with("dep"), starts_with("arr"))
```


Other helpers: `ends_with()`, `matches()`, `num_range()`, `one_of()`, 
`everything()`, `last_col()`

---

## `select_if()` 

```{r}
flights %>% 
  select_if(function(x) !is.numeric(x))
```

---

## `pull()` - grab a vector

```{r}
names(flights)

flights %>% pull("year") %>% head()
```

--

```{r}
flights %>% pull(1) %>% head
```

--

```{r}
flights %>% pull(-1) %>% .[1:4]
```

---

## `arrange()` - sort data

```{r}
flights %>%
  filter(month == 3, day == 2) %>%
  arrange(origin, dest)
```

---

## `arrange()` & `desc()`

By default, sorting is done in ascending order. To change that, use `desc()`.

```{r}
flights %>% 
  filter(month == 3, day == 2) %>% 
  arrange(desc(origin), dest) %>% 
  select(origin, dest, tailnum)
```

---

## `mutate()` - modify columns

```{r}
flights %>%
  select(year:day) %>%
  mutate(date = paste(year, month, day, sep = "/"))
```

---

## `transmute()` - create tibble from existing columns

```{r}
flights %>%
  transmute(date = paste(year, month, day, sep = "/"))
```

---

## `distinct()` - find unique rows

```{r}
flights %>%
  select(origin, dest) %>%
  distinct() %>%
  arrange(origin, dest)
```

---

## Sampling rows

.pull-left[

`sample_n()`

```{r}
flights %>% 
  select(year, origin) %>% 
  sample_n(10)
```

]

.pull-right[

`sample_frac()`

```{r}
flights %>% 
  select(year, origin) %>% 
  sample_frac(.00001)
```

]

---

class: inverse, center, middle

# Exercises

---

## Data: Wake county parcels

Parcel boundaries with address and revenue-related information for properties in 
Wake County, NC. http://data-wake.opendata.arcgis.com/datasets/parcels

```{r}
wake <- read_csv("https://www2.stat.duke.edu/~sms185/data/econ/parcels.csv")
```

---

.tiny[
```{r}
wake <- 
  janitor::clean_names(wake)

wake
```
]

---

## Tasks

1. Which city has the fewest land parcels in the dataset?
<br><br>
   
2. Create a tibble that shows the year a parcel was built and the total
   value, where all parcels are located in Apex and are more than one acre in
   area. Sort the result in ascending order by year built.
<br><br>

3. Choose a subset of five variables and 10 random rows from `wake` and save it 
   as an object named `wake_mini`. Experiment renaming variables with
   `select()` and `rename()` on `wake_mini`. What is the difference between the
   two functions?

???

## Solutions

.solution[
```{r}
wake %>% 
  count(city_decode) %>% 
  arrange(n) %>% 
  slice(1)

wake %>% 
  filter(city_decode == "APEX", calc_area > 1) %>% 
  select(year_built, total_value_assd) %>% 
  arrange(year_built)
```

- Function `select()` only keeps the variables you mention.

- Function `rename()` keeps all variables.
]

---

## `summarise()`

```{r}
flights %>%
  summarize(n(), min(dep_delay), max(dep_delay))
```

--

```{r}
flights %>% 
  summarize(
    n = n(), 
    min_dep_delay = min(dep_delay, na.rm = TRUE), 
    max_dep_delay = max(dep_delay, na.rm = TRUE)
  )
```

---

## Useful `summarise()` functions

- Center: `mean()`, `median()`
  
- Spread: `sd()`, `IQR()`, `mad()`

- Range: `min()`, `max()`, `quantile()`
 
- Position: `first()`, `last()`, `nth()`
 
- Count: `n()`, `n_distinct()`
 
- Logical: `any()`, `all()`

---

## `group_by()`

```{r eval=FALSE}
flights %>% 
  group_by(origin)
```

```{r eval=FALSE}
# A tibble: 336,776 x 19
# Groups:   origin [3] #<<
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# … with 336,766 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>
```

---

## `group_by()` then `summarise()`

```{r}
flights %>% 
  group_by(origin) %>%
  summarize(
    n = n(), 
    min_dep_delay = min(dep_delay, na.rm = TRUE), 
    max_dep_delay = max(dep_delay, na.rm = TRUE)
  )
```

---

```{r}
flights %>% 
  group_by(origin, carrier) %>%
  summarize(
    n = n(), 
    min_dep_delay = min(dep_delay, na.rm = TRUE), 
    max_dep_delay = max(dep_delay, na.rm = TRUE)
  ) %>%
  filter(n > 10000)
```

---

## `mutate()` with `group_by()`

```{r}
flights %>% 
  group_by(origin) %>%
  mutate(n = n()) %>%
  select(origin, n)
```

---

## Example

```{r}
flights %>% 
  group_by(origin, month) %>% 
  summarise(n = n()) %>% 
  slice(1)
```

--

<br/>

**Why do I have a tibble with three rows?**

---

## You may need to `ungroup()`

```{r}
flights %>% 
  group_by(origin, month) %>% 
  summarise(n = n()) %>% 
  ungroup() %>% 
  slice(1)
```

---

## `case_when()` - multi-case `if_else()`

Suppose we want to break the parcel size into three categories:
small, medium, large. 

--

.tiny[
```{r}
wake %>% 
  mutate(lot_size = case_when(
    deed_acres < .5   ~ "small",
    deed_acres < 1.5  ~ "medium",
    deed_acres >= 1.5 ~ "large"
  )
  ) %>% 
  select(deed_acres, lot_size)
```
]

---

class: inverse, center, middle

# Exercises

---

## Tasks

Continue to use `wake` for the following tasks.

1. Compute the mean area for each design style.
<br/>

2. Compute the median sale price for each year. *Hint:* `lubridate::year()`
<br/>

3. Which city with at least 1,000 parcels classified as a "Townhouse"
   had the highest proportion of parcels as "Townhouse"?

???

## Solutions

.solution[

```{r}

wake %>% 
  group_by(design_style_decode) %>% 
  summarise(mean_area = mean(calc_area))

wake %>% 
  filter(!is.na(sale_date)) %>% 
  mutate(year = lubridate::year(sale_date)) %>% 
  group_by(year) %>% 
  summarise(median_sale_price = median(totsalprice, na.rm = TRUE))

wake %>% 
  count(city_decode, design_style_decode) %>% 
  group_by(city_decode) %>% 
  mutate(prop = n / sum(n)) %>% 
  ungroup() %>% 
  filter(design_style_decode == "Townhouse", n >= 1000) %>% 
  arrange(desc(prop)) %>% 
  slice(1)

```

]

---

## References

- https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html

- Parcels. (2020). Data-wake.opendata.arcgis.com. Retrieved 9 February 2020, 
  from http://data-wake.opendata.arcgis.com/datasets/parcels