---
title: Bigish data
author: "Colin Rundel"
date: "2018-11-05"
output:
  xaringan::moon_reader:
    css: "slides.css"
    lib_dir: libs
    nature:
      highlightStyle: github
      highlightLines: true
      countIncrementalSlides: false
---
exclude: true

```{r setup, echo=FALSE, message=FALSE, warning=FALSE, include=FALSE}
options(
  htmltools.dir.version = FALSE, # for blogdown
  width = 80,
  tibble.width = 80
)

knitr::opts_chunk$set(
  fig.align = "center"
)

htmltools::tagList(rmarkdown::html_dependency_font_awesome())

library(dplyr)
```

---
class: middle
count: false

# Background

---

## Big vs Bigish data

* We will be working with data that is large, but will still fit in memory. 

* R *loves* to make extra copies of objects, so we need to be careful - even a handful of copies with exhaust the memory on most systems. 
    
    * Less of an issue on saxon (256 GB of Ram), but this is a shared resource - use too much and your session will be killed.

* In general you should prefer:

.center[
<br/>
*subsetting/vectorization >> apply > loops*

<br/>
*built-in/base > user C/C++ functions > user R functions*
]

---
class: middle
count: false

# Reading Data

---

## Reading parking data

If we use the basic approach of `read.csv`, we end up waiting a really long time,

```{r, eval=FALSE}
system.time(read.csv("/data/nyc_parking/nyc_parking_2014.csv"))

##    user  system elapsed  
## 377.359   7.080 384.411 
```

Over 6 minutes to read in a 1.7 gigabyte CSV file.

---

## Improvements 

If we use `stringsAsFactors` and `comment.char` arguments we can speed things up a bit.

```{r, eval=FALSE}
system.time(
  read.csv(
    "/data/nyc_parking/nyc_parking_2014.csv", 
    stringsAsFactors=FALSE, 
    comment.char=""
  )
)

##    user  system elapsed 
## 281.399   4.615 285.975 
```

We can take this farther by specifying the structure of the data using the `colClasses` argument.

---

## Alternatives - data.table

.small[
```{r, eval=FALSE}
system.time({
  nyc_fread = data.table::fread("/data/nyc_parking/nyc_parking_2014.csv")
})

## |--------------------------------------------------|
## |==================================================|
##    user  system elapsed 
##  44.636   2.393  24.606 

class(nyc_fread)

## "data.table" "data.frame"

nyc = as.data.frame(nyc_fread)
class(nyc)

## [1] "data.frame"
```
]

---

## Alternatives - readr 

.small[
```{r, eval=FALSE}
system.time({
  nyc = readr::read_csv("/data/nyc_parking/nyc_parking_2014.csv")
})
  
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Summons Number` = col_double(),
##   `Violation Code` = col_integer(),
##   `Street Code1` = col_integer(),
##   `Street Code2` = col_integer(),
##   `Street Code3` = col_integer(),
##   `Vehicle Expiration Date` = col_integer(),
##   `Violation Precinct` = col_integer(),
##   `Issuer Precinct` = col_integer(),
##   `Issuer Code` = col_integer(),
##   `Date First Observed` = col_integer(),
##   `Law Section` = col_integer(),
##   `Violation Legal Code` = col_integer(),
##   `Unregistered Vehicle?` = col_integer(),
##   `Vehicle Year` = col_integer(),
##   `Feet From Curb` = col_integer()
## )
## See spec(...) for full column specifications.
## |========================================| 100% 1713 MB

##   user  system elapsed 
## 97.474  10.719 129.293 
``` 
]

---

## Problems

.small[
```{r, eval=FALSE}
readr::problems(nyc)

## # A tibble: 654,437 x 5
##      row col                  expected   actual file                                       
##    <int> <chr>                <chr>      <chr>  <chr>                                      
##  1  2647 Violation Legal Code an integer T      '/data/nyc_parking/nyc_parking_2014.csv'
##  2  3792 Violation Legal Code an integer T      '/data/nyc_parking/nyc_parking_2014.csv'
##  3  4001 Violation Legal Code an integer T      '/data/nyc_parking/nyc_parking_2014.csv'
##  4  4002 Violation Legal Code an integer T      '/data/nyc_parking/nyc_parking_2014.csv'
##  5  4003 Violation Legal Code an integer T      '/data/nyc_parking/nyc_parking_2014.csv'
##  6  4004 Violation Legal Code an integer T      '/data/nyc_parking/nyc_parking_2014.csv'
##  7  4005 Violation Legal Code an integer T      '/data/nyc_parking/nyc_parking_2014.csv'
##  8  4006 Violation Legal Code an integer T      '/data/nyc_parking/nyc_parking_2014.csv'
##  9  4019 Violation Legal Code an integer T      '/data/nyc_parking/nyc_parking_2014.csv'
## 10  4059 Violation Legal Code an integer T      '/data/nyc_parking/nyc_parking_2014.csv'
## # ... with 654,427 more rows
```
]

---

## readr 

This is a recent package that is designed to be a fast and friendly way of reading tabular data into R.

<br/>

Core features:

* Faster than base R (~3-4x)

* No strings as factors

* No column name mangling

* Consistent argument/function naming scheme

* Plays nice with dplyr (`tbl_df`)

* Progress bars

---

.small[
```{r, eval=FALSE}
nyc

## Source: local data frame [9,100,278 x 43]
## 
##    Summons Number Plate ID Registration State Plate Type issue_date Violation Code
##             (dbl)    (chr)              (chr)      (chr)      (chr)          (int)
## 1      1361929741  FCJ5493                 NY        PAS 12/18/1970             20
## 2      1366962000  63540MC                 NY        COM 02/02/1971             46
## 3      1356906515  GFM1421                 NY        PAS 09/18/1971             40
## 4      1342296217  FYM5117                 NY        SRF 09/18/1971             21
## 5      1342296199  95V6675                 TX        PAS 09/18/1971             21
## 6      1342296187  GCY4187                 NY        SRF 09/18/1971             21
## 7      1337077380  18972BB                 NY        999 10/10/1971             14
## 8      1364523796  WNJ4730                 VA        PAS 04/05/1973             14
## 9      1359914924  68091JZ                 NY        COM 07/22/1973             46
## 10     1355498326  EWV4127                 NY        PAS 08/12/1973             21
## ..            ...      ...                ...        ...        ...            ...
## Variables not shown: Vehicle Body Type (chr), Vehicle Make (chr), Issuing Agency (chr),
##   Street Code1 (int), Street Code2 (int), Street Code3 (int), Vehicle Expiration Date
##   (int), Violation Location (chr), Violation Precinct (int), Issuer Precinct (int),
##   Issuer Code (int), Issuer Command (chr), Issuer Squad (chr), Violation Time (chr), Time
##   First Observed (chr), Violation County (chr), Violation In Front Of Or Opposite (chr),
##   House Number (chr), Street Name (chr), Intersecting Street (chr), Date First Observed
##   (int), Law Section (int), Sub Division (chr), Violation Legal Code (int), Days Parking
##   In Effect (chr), From Hours In Effect (chr), To Hours In Effect (chr), Vehicle Color
##   (chr), Unregistered Vehicle? (int), Vehicle Year (int), Meter Number (chr), Feet From
##   Curb (int), Violation Post Code (chr), Violation Description (chr), No Standing or
##   Stopping Violation (chr), Hydrant Violation (chr), Double Parking Violation (chr)
```
]

---

## Fixing column names

.small[
```{r eval=FALSE}
(nyc = janitor::clean_names(nyc))

## # A tibble: 9,100,278 x 43
##    summons_number plate_id registration_st… plate_type issue_date violation_code
##             <dbl> <chr>    <chr>            <chr>      <chr>               <int>
##  1     1361929741 FCJ5493  NY               PAS        12/18/1970             20
##  2     1366962000 63540MC  NY               COM        02/02/1971             46
##  3     1356906515 GFM1421  NY               PAS        09/18/1971             40
##  4     1342296217 FYM5117  NY               SRF        09/18/1971             21
##  5     1342296199 95V6675  TX               PAS        09/18/1971             21
##  6     1342296187 GCY4187  NY               SRF        09/18/1971             21
##  7     1337077380 18972BB  NY               999        10/10/1971             14
##  8     1364523796 WNJ4730  VA               PAS        04/05/1973             14
##  9     1359914924 68091JZ  NY               COM        07/22/1973             46
## 10     1355498326 EWV4127  NY               PAS        08/12/1973             21
## # ... with 9,100,268 more rows, and 37 more variables: vehicle_body_type <chr>,
## #   vehicle_make <chr>, issuing_agency <chr>, street_code1 <int>, street_code2 <int>,
## #   street_code3 <int>, vehicle_expiration_date <int>, violation_location <chr>,
## #   violation_precinct <int>, issuer_precinct <int>, issuer_code <int>,
## #   issuer_command <chr>, issuer_squad <chr>, violation_time <chr>,
## #   time_first_observed <chr>, violation_county <chr>,
## #   violation_in_front_of_or_opposite <chr>, house_number <chr>, street_name <chr>,
## #   intersecting_street <chr>, date_first_observed <int>, law_section <int>,
## #   sub_division <chr>, violation_legal_code <int>, days_parking_in_effect <chr>,
## #   from_hours_in_effect <chr>, to_hours_in_effect <chr>, vehicle_color <chr>,
## #   unregistered_vehicle <int>, vehicle_year <int>, meter_number <chr>,
## #   feet_from_curb <int>, violation_post_code <chr>, violation_description <chr>,
## #   no_standing_or_stopping_violation <chr>, hydrant_violation <chr>,
## #   double_parking_violation <chr>
```
]

---

## Simplifying 

There is a lot of variables we won't care about for the time being, so lets make life easier by selecting a subset of columns.

.small[
```{r eval=FALSE}
(nyc_trim = nyc %>% 
  select(registration_state:issuing_agency, 
         violation_location, violation_precinct, violation_time,
         house_number:intersecting_street, vehicle_color))

## # A tibble: 9,100,278 x 14
##    registration_st… plate_type issue_date violation_code vehicle_body_ty… vehicle_make
##    <chr>            <chr>      <chr>               <int> <chr>            <chr>       
##  1 NY               PAS        12/18/1970             20 SUBN             GMC         
##  2 NY               COM        02/02/1971             46 DELV             FRUEH       
##  3 NY               PAS        09/18/1971             40 SDN              MAZDA       
##  4 NY               SRF        09/18/1971             21 SUBN             NISSA       
##  5 TX               PAS        09/18/1971             21 NA               GMC         
##  6 NY               SRF        09/18/1971             21 VAN              FORD        
##  7 NY               999        10/10/1971             14 BUS              INTER       
##  8 VA               PAS        04/05/1973             14 SDN              TOYOT       
##  9 NY               COM        07/22/1973             46 DELV             TOYOT       
## 10 NY               PAS        08/12/1973             21 SUBN             ACURA       
## # ... with 9,100,268 more rows, and 8 more variables: issuing_agency <chr>,
## #   violation_location <chr>, violation_precinct <int>, violation_time <chr>,
## #   house_number <chr>, street_name <chr>, intersecting_street <chr>,
## #   vehicle_color <chr>
```
]

---

## Object Sizes

```shell
cr173@gort [nyc_parking]$ ls -lah
total 1.7G
drwxr-xr-x 4 cr173 visitor 4.0K Nov  5 11:55 .
drwxrwxrwx 3 root  root    4.0K Nov  5 12:47 ..
-rwxr--r-- 1 cr173 visitor  14K Nov  5 11:53 fine_definition.csv
drwxr-xr-x 2 cr173 visitor 4.0K Nov  5 11:53 nybb
-rwxr--r-- 1 cr173 visitor 1.7G Nov  5 11:53 nyc_parking_2014.csv
drwxrwxr-x 2 cr173 visitor 4.0K Nov  5 11:53 pluto_manhattan
```

```{r eval=FALSE}
pryr::object_size(nyc)
## 2.83 GB

pryr::object_size(nyc_fread)
## 2.69 GB

pryr::object_size(nyc_trim)
## 998 MB
```

---

## Clean data?

How many different car colors are in this data set?

--

.pull-left[ .small[
```{r eval=FALSE}
nyc %>% 
  count(vehicle_color) %>% 
  arrange(desc(n))

## # A tibble: 2,891 x 2
##    vehicle_color       n
##    <chr>           <int>
##  1 WHITE         1348510
##  2 GY            1214213
##  3 WH            1192609
##  4 BK             941007
##  5 BLACK          665194
##  6 BL             442368
##  7 GREY           417142
##  8 SILVE          313770
##  9 BLUE           301119
## 10 RD             272772
## # ... with 2,881 more rows
```
] ]

.pull-right[ .small[
```{r eval=FALSE}
nyc %>% 
  count(vehicle_color)


## # A tibble: 2,891 x 2
##    vehicle_color     n
##    <chr>         <int>
##  1 -                 5
##  2 --                2
##  3 ---               1
##  4 ----              1
##  5 -----             1
##  6 -.                1
##  7 ,                 1
##  8 ,.A               1
##  9 ,.J.,             1
## 10 ,SILV             2
## # ... with 2,881 more rows
```
] ]

---

## Fixing Dates 

```{r, eval=FALSE}
library(lubridate)
class(nyc$issue_date)

## [1] "character"

nyc = nyc %>% mutate(issue_date = mdy(issue_date, tz="America/New_York"))
class(nyc$issue_date)

## [1] "POSIXct" "POSIXt"

head(nyc$issue_date)

## [1] "1970-12-18 EST" "1971-02-02 EST" "1971-09-18 EDT" "1971-09-18 EDT" "1971-09-18 EDT" "1971-09-18 EDT"
```

---

.small[
```{r, eval=FALSE}
nyc

## # A tibble: 9,100,278 x 43
##    summons_number plate_id registration_st… plate_type issue_date         
##             <dbl> <chr>    <chr>            <chr>      <dttm>             
##  1     1361929741 FCJ5493  NY               PAS        1970-12-18 00:00:00
##  2     1366962000 63540MC  NY               COM        1971-02-02 00:00:00
##  3     1356906515 GFM1421  NY               PAS        1971-09-18 00:00:00
##  4     1342296217 FYM5117  NY               SRF        1971-09-18 00:00:00
##  5     1342296199 95V6675  TX               PAS        1971-09-18 00:00:00
##  6     1342296187 GCY4187  NY               SRF        1971-09-18 00:00:00
##  7     1337077380 18972BB  NY               999        1971-10-10 00:00:00
##  8     1364523796 WNJ4730  VA               PAS        1973-04-05 00:00:00
##  9     1359914924 68091JZ  NY               COM        1973-07-22 00:00:00
## 10     1355498326 EWV4127  NY               PAS        1973-08-12 00:00:00
## # ... with 9,100,268 more rows, and 38 more variables: violation_code <int>,
## #   vehicle_body_type <chr>, vehicle_make <chr>, issuing_agency <chr>,
## #   street_code1 <int>, street_code2 <int>, street_code3 <int>,
## #   vehicle_expiration_date <int>, violation_location <chr>,
## #   violation_precinct <int>, issuer_precinct <int>, issuer_code <int>,
## #   issuer_command <chr>, issuer_squad <chr>, violation_time <chr>,
## #   time_first_observed <chr>, violation_county <chr>,
## #   violation_in_front_of_or_opposite <chr>, house_number <chr>,
## #   street_name <chr>, intersecting_street <chr>, date_first_observed <int>,
## #   law_section <int>, sub_division <chr>, violation_legal_code <int>,
## #   days_parking_in_effect <chr>, from_hours_in_effect <chr>,
## #   to_hours_in_effect <chr>, vehicle_color <chr>, unregistered_vehicle <int>,
## #   vehicle_year <int>, meter_number <chr>, feet_from_curb <int>,
## #   violation_post_code <chr>, violation_description <chr>,
## #   no_standing_or_stopping_violation <chr>, hydrant_violation <chr>,
## #   double_parking_violation <chr>
```
]

---

## More fixing dates 

.small[
```{r, eval=FALSE}
range(nyc$issue_date)

## [1] "1970-12-18 EST" "2069-12-23 EST"

nyc$issue_date %>% year() %>% table()

##   1970    1971    1973    1974    1976    1977    1979    1981    1983    1984    1987 
##      1       6      10       1       2       1       2       4       1       2       3 
##   1990    1991    1996    2000    2001    2002    2003    2004    2005    2006    2007 
##      2       1       1     319      91       7      39      77       9      11      13 
##   2008    2009    2010    2011    2012    2013    2014    2015    2016    2017    2018 
##      8       9     129     251     618 4379109 4716512    1522     296     309     181 
##   2019    2020    2021    2022    2023    2024    2025    2026    2027    2028    2029 
##    329      18      26       1      31      23      10       4       4       7       3 
##   2030    2031    2032    2033    2040    2041    2043    2044    2045    2046    2047 
##     45      93       3       8       1      39       9       9       2       7       6 
##   2048    2049    2050    2051    2052    2053    2060    2061    2063    2064    2066 
##      1       3       1      12       2       1       3      10       9       5       3 
##   2067    2069 
##      2       1 
```
]

---

.small[
```{r, eval=FALSE}
filter(nyc, issue_date >= mdy("1/1/2013"), issue_date <= mdy("12/31/2014"))

## # A tibble: 9,095,621 x 43
##    Summons_Number Plate_ID registration_state Plate_Type issue_date Violation_Code Vehicle_Body_Type
##             <dbl>    <chr>              <chr>      <chr>     <dttm>          <int>             <chr>
##  1     1354300671   S28CMN                 NJ        PAS 2013-01-01             20              SUBN
##  2     1349345910   XTX057                 MI        PAS 2013-01-01             99               SDN
##  3     1268869855  GJK5565                 NY        PAS 2013-01-01             20              SUBN
##  4     1268869843  EPS8803                 NY        PAS 2013-01-01             20               SDN
##  5     1365149122  FWZ5341                 NY        SRF 2013-01-01             20               SDN
##  6     1364348044  EPE8859                 NY        PAS 2013-01-01             46              SUBN
##  7     1364348032  ERT3706                 NY        PAS 2013-01-01             62               SDN
##  8     1364838760  DPA3951                 NY        PAS 2013-01-01             98               SDN
##  9     1364832835  ETS1289                 NY        PAS 2013-01-01             71               SDN
## 10     1364805819  FGE4351                 NY        PAS 2013-01-01             71              SUBN
## # ... with 9,095,611 more rows, and 36 more variables: Vehicle_Make <chr>, issuing_agency <chr>,
## #   Street_Code1 <int>, Street_Code2 <int>, Street_Code3 <int>, Vehicle_Expiration_Date <int>,
## #   violation_location <chr>, violation_precinct <int>, Issuer_Precinct <int>, Issuer_Code <int>,
## #   Issuer_Command <chr>, Issuer_Squad <chr>, violation_time <chr>, Time_First_Observed <chr>,
## #   Violation_County <chr>, Violation_In_Front_Of_Or_Opposite <chr>, house_number <chr>, street_name <chr>,
## #   intersecting_street <chr>, Date_First_Observed <int>, Law_Section <int>, Sub_Division <chr>,
## #   Violation_Legal_Code <int>, Days_Parking_In_Effect <chr>, From_Hours_In_Effect <chr>,
## #   To_Hours_In_Effect <chr>, vehicle_color <chr>, `Unregistered_Vehicle?` <int>, Vehicle_Year <int>,
## #   Meter_Number <chr>, Feet_From_Curb <int>, Violation_Post_Code <chr>, Violation_Description <chr>,
## #   No_Standing_or_Stopping_Violation <chr>, Hydrant_Violation <chr>, Double_Parking_Violation <chr>
```
]

---

## Performance?

```{r, eval=FALSE}
system.time(filter(nyc, issue_date >= mdy("1/1/2013"), issue_date <= mdy("12/31/2014")))
  
##   user  system elapsed  
##  4.831   1.566   6.427  

system.time(filter(nyc, year(issue_date) %in% c(2013,2014)))
  
##   user  system elapsed   
##  6.864   1.952   8.879   
```

---

## Putting it all together

.small[
```{r, eval=FALSE}
nyc = readr::read_csv("/data/nyc_parking/nyc_parking_2014.csv") %>%
  janitor::clean_names() %>%
  select(registration_state:issuing_agency, 
       violation_location, violation_precinct, violation_time,
       number:intersecting_street, vehicle_color) %>%
  mutate(issue_date = mdy(issue_date)) %>% 
  mutate(issue_day = day(issue_date),
         issue_month = month(issue_date),
         issue_year = year(issue_date),
         issue_wday = wday(issue_date, label=TRUE)) %>%
  filter(issue_year %in% 2013:2014)

nyc

## # A tibble: 9,095,621 x 47
##    summons_number plate_id registration_st… plate_type issue_date          violation_code
##             <dbl> <chr>    <chr>            <chr>      <dttm>                       <int>
##  1     1354300671 S28CMN   NJ               PAS        2013-01-01 00:00:00             20
##  2     1349345910 XTX057   MI               PAS        2013-01-01 00:00:00             99
##  3     1268869855 GJK5565  NY               PAS        2013-01-01 00:00:00             20
##  4     1268869843 EPS8803  NY               PAS        2013-01-01 00:00:00             20
##  5     1365149122 FWZ5341  NY               SRF        2013-01-01 00:00:00             20
##  6     1364348044 EPE8859  NY               PAS        2013-01-01 00:00:00             46
##  7     1364348032 ERT3706  NY               PAS        2013-01-01 00:00:00             62
##  8     1364838760 DPA3951  NY               PAS        2013-01-01 00:00:00             98
##  9     1364832835 ETS1289  NY               PAS        2013-01-01 00:00:00             71
## 10     1364805819 FGE4351  NY               PAS        2013-01-01 00:00:00             71
## # ... with 9,095,611 more rows, and 41 more variables: vehicle_body_type <chr>, vehicle_make <chr>,
## #   issuing_agency <chr>, street_code1 <int>, street_code2 <int>, street_code3 <int>,
## #   vehicle_expiration_date <int>, violation_location <chr>, violation_precinct <int>,
## #   issuer_precinct <int>, issuer_code <int>, issuer_command <chr>, issuer_squad <chr>,
## #   violation_time <chr>, time_first_observed <chr>, violation_county <chr>,
## #   violation_in_front_of_or_opposite <chr>, house_number <chr>, street_name <chr>,
## #   intersecting_street <chr>, date_first_observed <int>, law_section <int>, sub_division <chr>,
## #   violation_legal_code <int>, days_parking_in_effect <chr>, from_hours_in_effect <chr>,
## #   to_hours_in_effect <chr>, vehicle_color <chr>, unregistered_vehicle <int>, vehicle_year <int>,
## #   meter_number <chr>, feet_from_curb <int>, violation_post_code <chr>,
## #   violation_description <chr>, no_standing_or_stopping_violation <chr>, hydrant_violation <chr>,
## #   double_parking_violation <chr>, issue_day <int>, issue_month <dbl>, issue_year <dbl>,
## #   issue_wday <ord>
```
]

---

## Ticket Frequency 

```{r, eval=FALSE}
nyc %>% 
  group_by(issue_date) %>% 
  summarize(n=n()) %>%
  ggplot(aes(x=issue_date, y=n)) + 
    geom_line() + 
    xlim(mdy("7/1/2013"), mdy("6/30/2014"))
```

```{r echo=FALSE}
knitr::include_graphics("imgs/nyc_date_freq.png")
```


---


## Exercise 1

Some more dplyr practice,

1. Create a plot of the weekly pattern (tickets issued per day of the week) - When are you most likely to get a ticket and when are you least likely to get a ticket?

2. Which precinct issued the most tickets to Toyotas?

---
class: middle
count: false

# dbplyr

---

## Creating a database 

.small[
```{r eval=FALSE}
(db = dplyr::src_sqlite("/data/nyc_parking/nyc_parking_2014_cleaned.sqlite", create = TRUE))

##  src:  sqlite 3.22.0 [/data/nyc_parking/nyc_parking_2014_cleaned.sqlite]
##  tbls:
```
]

--

.small[
```{r eval=FALSE}
nyc_sql = dplyr::copy_to(db, nyc, temporary = FALSE)
db

## src:  sqlite 3.22.0 [/data/nyc_parking/nyc_parking_2014_cleaned.sqlite]
## tbls: nyc, sqlite_stat1, sqlite_stat4
```
]

--

.small[
```r
pryr::object_size(db)
## 6.54 kB

pryr::object_size(nyc_sql)
## 9.54 kB
```

```shell
> ls -lah /data/nyc_parking/*.sqlite

-rw-r--r-- 1 cr173 visitor 698M Nov  5 13:57 /data/nyc_parking/nyc_parking_2014_cleaned.sqlite
```
]

---

.small[
```{r eval=FALSE}
nyc_sql = dplyr::tbl(db,"nyc")

class(nyc_sql)
## [1] "tbl_dbi"  "tbl_sql"  "tbl_lazy" "tbl" 
```
]

--

.small[
```r
nyc_sql
## # Source:   table<nyc> [?? x 18]
## # Database: sqlite 3.22.0 [/data/nyc_parking/nyc_parking_2014_cleaned.sqlite]
##    registration_state Plate_Type issue_date Violation_Code Vehicle_Body_Type Vehicle_Make issuing_agency
##                 <chr>      <chr>      <dbl>          <int>             <chr>        <chr>          <chr>
##  1                 NJ        PAS      15706             20              SUBN        CHRYS              P
##  2                 MI        PAS      15706             99               SDN        TOYOT              P
##  3                 NY        PAS      15706             20              SUBN         FORD              P
##  4                 NY        PAS      15706             20               SDN        INFIN              P
##  5                 NY        SRF      15706             20               SDN        NISSA              P
##  6                 NY        PAS      15706             46              SUBN        CHRYS              P
##  7                 NY        PAS      15706             62               SDN        HYUND              P
##  8                 NY        PAS      15706             98               SDN        TOYOT              P
##  9                 NY        PAS      15706             71               SDN        TOYOT              P
## 10                 NY        PAS      15706             71              SUBN         FORD              P
## # ... with more rows, and 11 more variables: violation_location <chr>, violation_precinct <int>,
## #   violation_time <chr>, house_number <chr>, street_name <chr>, intersecting_street <chr>,
## #   vehicle_color <chr>, issue_day <int>, issue_month <dbl>, issue_year <dbl>, issue_wday <chr>
```
]

--

.small[
```r
str(nyc_sql)
## List of 2
##  $ src:List of 2
##   ..$ con  :Formal class 'SQLiteConnection' [package "RSQLite"] with 7 slots
##   .. .. ..@ ptr                :<externalptr> 
##   .. .. ..@ dbname             : chr "/data/nyc_parking/nyc_parking_2014_cleaned.sqlite"
##   .. .. ..@ loadable.extensions: logi TRUE
##   .. .. ..@ flags              : int 70
##   .. .. ..@ vfs                : chr ""
##   .. .. ..@ ref                :<environment: 0x55788ecb6f38> 
##   .. .. ..@ bigint             : chr "integer64"
##   ..$ disco:<environment: 0x557888dd9840> 
##   ..- attr(*, "class")= chr [1:3] "src_dbi" "src_sql" "src"
##  $ ops:List of 2
##   ..$ x   : 'ident' chr "nyc"
##   ..$ vars: chr [1:18] "registration_state" "plate_type" "issue_date" "violation_code" ...
##   ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
##  - attr(*, "class")= chr [1:4] "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl"
```
]

---

## Using dplyr with sqlite 

.small[
```{r, eval=FALSE}
(addr = nyc_sql %>%
        select(issue_date, issuing_agency, violation_precinct, house_number, street_name) %>%
        filter(violation_precinct >=1, violation_precinct <= 34)
)

## # Source:   lazy query [?? x 5]
## # Database: sqlite 3.22.0 [/data/nyc_parking/nyc_parking_2014_cleaned.sqlite]
##    issue_date issuing_agency violation_precinct house_number street_name    
##         <dbl> <chr>                       <int> <chr>        <chr>          
##  1      15706 P                              28 102          W 123 ST       
##  2      15706 P                              23 2121         1 AVE          
##  3      15706 P                              23 60           E 106 ST       
##  4      15706 P                               5 54           ELIZABETH ST   
##  5      15706 P                              26 488-490      ST NICHOLAS AVE
##  6      15706 P                              26 1420         AMSTERDAM AVE  
##  7      15706 P                              25 219          E 121 ST       
##  8      15706 P                              12 630          LEXINGTON AVE  
##  9      15706 P                              18 413          48 TH ST       
## 10      15706 P                              25 2123         MADISON AVE    
## # ... with more rows
```
]

---

## SQL Query

.small[
```{r, eval=FALSE}
class(addr)

## [1] "tbl_dbi"  "tbl_sql"  "tbl_lazy" "tbl" 

show_query(addr)

## <SQL>
## SELECT *
## FROM (SELECT `issue_date`, `issuing_agency`, `violation_precinct`, `house_number`, `street_name`
## FROM `nyc`)
## WHERE ((`violation_precinct` >= 1.0) AND (`violation_precinct` <= 34.0))
```
]


---

## SQL Grouping 

```{r, eval=FALSE}
addr %>% group_by(issuing_agency, violation_precinct) %>% summarize(n=n())

##  # Source:   lazy query [?? x 3]
##  # Database: sqlite 3.22.0 [/data/nyc_parking/nyc_parking_2014_cleaned.sqlite]
##  # Groups:   issuing_agency
##     issuing_agency violation_precinct     n
##              <chr>              <int> <int>
##   1              A                  1    13
##   2              A                  7     1
##   3              A                 10    24
##   4              A                 11     1
##   5              A                 14    47
##   6              A                 33    11
##   7              B                 25     2
##   8              C                  5    73
##   9              C                 13     7
##  10              D                  1     1
##  # ... with more rows
```

---

## SQL Query

```{r, eval=FALSE}
addr %>% group_by(issuing_agency, violation_precinct) %>% summarize(n=n()) %>% show_query()

## <SQL>
## SELECT `issuing_agency`, `violation_precinct`, COUNT() AS `n`
## FROM (SELECT `issue_date` AS `issue_date`, `issuing_agency` AS `issuing_agency`, `violation_precinct` AS `## violation_precinct`, `house_number` AS `house_number`, `street_name` AS `street_name`
## FROM `nyc`)
## WHERE ((`violation_precinct` >= 1.0) AND (`violation_precinct` <= 34.0))
## GROUP BY `issuing_agency`, `violation_precinct`
```

---

## SQL Translation 

.small[
In general, dplyr / dbplyr knows how to translate basic math, logical, and summary functions from R to SQL.
dbplyr has a function, `translate_sql`, that lets you experiment with how R functions are translated to SQL. 

```{r, error=TRUE}
dbplyr::translate_sql(x == 1 & (y < 2 | z > 3))
dbplyr::translate_sql(x ^ 2 < 10)
dbplyr::translate_sql(x %% 2 == 10)
```

```{r error=TRUE}
dbplyr::translate_sql(paste(x,y))
dbplyr::translate_sql(mean(x))
```
]

---

.small[
```{r error=TRUE}
dbplyr::translate_sql(sd(x))
dbplyr::translate_sql(paste(x,y))
dbplyr::translate_sql(cumsum(x))
dbplyr::translate_sql(lag(x))
dbplyr::translate_sql(lm(y~x))
```
]


---

## Complications

.small[
```{r, eval=FALSE}
addr %>% mutate(address = paste(house_number, street_name))

## # Source:   lazy query [?? x 6]
## # Database: sqlite 3.22.0 [/data/nyc_parking/nyc_parking_2014_cleaned.sqlite]
##    issue_date issuing_agency violation_precinct house_number street_name     address                
##         <dbl> <chr>                       <int> <chr>        <chr>           <chr>                  
##  1      15706 P                              28 102          W 123 ST        102 W 123 ST           
##  2      15706 P                              23 2121         1 AVE           2121 1 AVE             
##  3      15706 P                              23 60           E 106 ST        60 E 106 ST            
##  4      15706 P                               5 54           ELIZABETH ST    54 ELIZABETH ST        
##  5      15706 P                              26 488-490      ST NICHOLAS AVE 488-490 ST NICHOLAS AVE
##  6      15706 P                              26 1420         AMSTERDAM AVE   1420 AMSTERDAM AVE     
##  7      15706 P                              25 219          E 121 ST        219 E 121 ST           
##  8      15706 P                              12 630          LEXINGTON AVE   630 LEXINGTON AVE      
##  9      15706 P                              18 413          48 TH ST        413 48 TH ST           
## 10      15706 P                              25 2123         MADISON AVE     2123 MADISON AVE       
## # ... with more rows

addr %>% mutate(address = paste(house_number, street_name)) %>% show_query()

## <SQL>
## SELECT `issue_date`, `issuing_agency`, `violation_precinct`, `house_number`, `street_name`, 
##        `house_number` || ' ' || ## `street_name` AS `address`
## FROM (SELECT `issue_date`, `issuing_agency`, `violation_precinct`, `house_number`, `street_name`
##       FROM `nyc`)
## WHERE ((`violation_precinct` >= 1.0) AND (`violation_precinct` <= 34.0))
```
]

---

## (Unfair) Timings 

.small[
```{r, eval=FALSE}
system.time(
  nyc %>%
    select(issue_date, issuing_agency, violation_precinct, house_number, street_name) %>%
    filter(violation_precinct >=1, violation_precinct <= 34) %>%
    group_by(issuing_agency, violation_precinct) %>%
    summarize(n=n())
)

##   user  system elapsed 
##  0.639   0.099   0.740 
```
]

.small[
```{r, eval=FALSE}
system.time(
  nyc_sql %>%
    select(issue_date, issuing_agency, violation_precinct, house_number, street_name) %>%
    filter(violation_precinct >=1, violation_precinct <= 34) %>%
    group_by(issuing_agency, violation_precinct) %>%
    summarize(n=n())
)

##   user  system elapsed 
##  0.024   0.011   0.034 
```
]

--

`nyc_sql` was ~22x times faster than `nyc`, but the former is disk based while the latter is in memory, why this discrepancy?

---

## Laziness

dplyr / dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.

* When building a query, we don't want the entire table, often we want just enough to check if our query is working.

* Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.

* Therefore, by default `dplyr`
    
    * won't connect and query the database until absolutely necessary (e.g. show output),

    * and unless explicitly told to, will only query a handful of rows to give a sense of what the result will look like


---

.small[
```{r eval=FALSE}
nyc_sql %>%
  select(issue_date, issuing_agency, violation_precinct, house_number, street_name) %>%
  filter(violation_precinct >=1, violation_precinct <= 34) %>%
  group_by(issuing_agency, violation_precinct) %>%
  summarize(n=n())

## # Source:   lazy query [?? x 3]
## # Database: sqlite 3.22.0 [/data/nyc_parking/nyc_parking_2014_cleaned.sqlite]
## # Groups:   issuing_agency
##    issuing_agency violation_precinct     n
##             <chr>              <int> <int>
##  1              A                  1    13
##  2              A                  7     1
##  3              A                 10    24
##  4              A                 11     1
##  5              A                 14    47
##  6              A                 33    11
##  7              B                 25     2
##  8              C                  5    73
##  9              C                 13     7
## 10              D                  1     1
## # ... with more rows
```
]

---

## Full query 
.small[

To force a full query *and* a return of the complete result it is necessart to use the `collect` function.

```{r, eval=FALSE}
system.time({
  nyc_sql %>%
    select(issue_date, issuing_agency, violation_precinct, house_number, street_name) %>%
    filter(violation_precinct >=1, violation_precinct <= 34) %>%
    group_by(issuing_agency, violation_precinct) %>%
    summarize(n=n()) %>%
    collect()
})

##   user  system elapsed 
##  5.915   0.507   6.445 

## # A tibble: 199 x 3
## # Groups:   issuing_agency [15]
##    issuing_agency violation_precinct     n
##             <chr>              <int> <int>
##  1              A                  1    13
##  2              A                  7     1
##  3              A                 10    24
##  4              A                 11     1
##  5              A                 14    47
##  6              A                 33    11
##  7              B                 25     2
##  8              C                  5    73
##  9              C                 13     7
## 10              D                  1     1
## # ... with 189 more rows
```


`compute` and `collapse` also force a full query but have slightly different behavior and return types. 

]

---

## Acknowledgments

Above materials are derived in part from the following sources:

* [dbplyr - Introduction Vignette](https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html)
* [dbplyr - SQL Translation](https://cran.r-project.org/web/packages/dbplyr/vignettes/sql-translation.html)