---
title: Bigish data
author: "Colin Rundel"
date: "2018-04-03"
output:
xaringan::moon_reader:
css: "slides.css"
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
exclude: true
```{r setup, echo=FALSE, message=FALSE, warning=FALSE, include=FALSE}
options(
htmltools.dir.version = FALSE, # for blogdown
width = 80,
tibble.width = 80
)
knitr::opts_chunk$set(
fig.align = "center"
)
htmltools::tagList(rmarkdown::html_dependency_font_awesome())
library(dplyr)
```
---
class: middle
count: false
# Background
---
## Big vs Bigish data
* We will be working with data that is large, but will still fit in memory.
* R *loves* to make extra copies of objects, so we need to be careful - even a handful of copies with exhaust the memory on most systems.
* Less of an issue on saxon (256 GB of Ram), but this is a shared resource - use too much and your session will be killed.
* In general you should prefer:
.center[
*subsetting/vectorization >> apply > loops*
*built-in/base > user C/C++ functions > user R functions*
]
---
class: middle
count: false
# Reading Data
---
## Reading parking data
If we use the basic approach of `read.csv`, we end up waiting a really long time,
```{r, eval=FALSE}
system.time(read.csv("/data/nyc_parking/NYParkingViolations.csv"))
## user system elapsed
## 377.359 7.080 384.411
```
Over 6 minutes to read in a 1.7 gigabyte CSV file.
---
## Improvements
If we use `stringsAsFactors` and `comment.char` arguments we can speed things up a bit.
```{r, eval=FALSE}
system.time(
read.csv(
"/data/nyc_parking/NYParkingViolations.csv",
stringsAsFactors=FALSE,
comment.char=""
)
)
## user system elapsed
## 281.399 4.615 285.975
```
We can take this farther by specifying the structure of the data using the `colClasses` argument.
---
## Alternatives - data.table
```{r, eval=FALSE}
system.time({
nyc = data.table::fread("/data/nyc_parking/NYParkingViolations.csv")
})
## Read 9100278 rows and 43 (of 43) columns from 1.673 GB file in 00:00:52
## user system elapsed
## 50.855 0.970 51.793
class(nyc)
## "data.table" "data.frame"
nyc = as.data.frame(nyc)
class(nyc)
## [1] "data.frame"
```
---
## Alternatives - readr
```{r, eval=FALSE}
nyc = readr::read_csv("/data/nyc_parking/NYParkingViolations.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## `Summons Number` = col_double(),
## `Violation Code` = col_integer(),
## `Street Code1` = col_integer(),
## `Street Code2` = col_integer(),
## `Street Code3` = col_integer(),
## `Vehicle Expiration Date` = col_integer(),
## `Violation Precinct` = col_integer(),
## `Issuer Precinct` = col_integer(),
## `Issuer Code` = col_integer(),
## `Date First Observed` = col_integer(),
## `Law Section` = col_integer(),
## `Violation Legal Code` = col_integer(),
## `Unregistered Vehicle?` = col_integer(),
## `Vehicle Year` = col_integer(),
## `Feet From Curb` = col_integer()
## )
## See spec(...) for full column specifications.
## |========================================| 100% 1713 MB
```
---
##
```{r, eval=FALSE}
## Warning: 654437 parsing failures.
## row col expected actual
## 2647 Violation Legal Code an integer T
## 3792 Violation Legal Code an integer T
## 4001 Violation Legal Code an integer T
## 4002 Violation Legal Code an integer T
## 4003 Violation Legal Code an integer T
## .... .................... .......... ......
## See problems(...) for more details.
##
## user system elapsed
## 103.196 6.792 108.993
```
---
## readr
This is a recent package that is designed to be a fast and friendly way of reading tabular data into R.
Core features:
* Faster than base R (~3-4x)
* No strings as factors
* No column name mangling
* Consistent argument/function naming scheme
* Plays nice with dplyr (`tbl_df`)
* Progress bars
---
```{r, eval=FALSE}
nyc
## Source: local data frame [9,100,278 x 43]
##
## Summons Number Plate ID Registration State Plate Type Issue_Date Violation Code
## (dbl) (chr) (chr) (chr) (chr) (int)
## 1 1361929741 FCJ5493 NY PAS 12/18/1970 20
## 2 1366962000 63540MC NY COM 02/02/1971 46
## 3 1356906515 GFM1421 NY PAS 09/18/1971 40
## 4 1342296217 FYM5117 NY SRF 09/18/1971 21
## 5 1342296199 95V6675 TX PAS 09/18/1971 21
## 6 1342296187 GCY4187 NY SRF 09/18/1971 21
## 7 1337077380 18972BB NY 999 10/10/1971 14
## 8 1364523796 WNJ4730 VA PAS 04/05/1973 14
## 9 1359914924 68091JZ NY COM 07/22/1973 46
## 10 1355498326 EWV4127 NY PAS 08/12/1973 21
## .. ... ... ... ... ... ...
## Variables not shown: Vehicle Body Type (chr), Vehicle Make (chr), Issuing Agency (chr),
## Street Code1 (int), Street Code2 (int), Street Code3 (int), Vehicle Expiration Date
## (int), Violation Location (chr), Violation Precinct (int), Issuer Precinct (int),
## Issuer Code (int), Issuer Command (chr), Issuer Squad (chr), Violation Time (chr), Time
## First Observed (chr), Violation County (chr), Violation In Front Of Or Opposite (chr),
## House Number (chr), Street Name (chr), Intersecting Street (chr), Date First Observed
## (int), Law Section (int), Sub Division (chr), Violation Legal Code (int), Days Parking
## In Effect (chr), From Hours In Effect (chr), To Hours In Effect (chr), Vehicle Color
## (chr), Unregistered Vehicle? (int), Vehicle Year (int), Meter Number (chr), Feet From
## Curb (int), Violation Post Code (chr), Violation Description (chr), No Standing or
## Stopping Violation (chr), Hydrant Violation (chr), Double Parking Violation (chr)
```
---
## Fixing column names
```{r eval=FALSE}
library(stringr)
names(nyc) = str_replace_all(names(nyc), " ", "_")
nyc
## # A tibble: 9,100,278 x 43
## Summons_Number Plate_ID Registration_State Plate_Type Issue_Date Violation_Code
##
## 1 1361929741 FCJ5493 NY PAS 12/18/1970 20
## 2 1366962000 63540MC NY COM 02/02/1971 46
## 3 1356906515 GFM1421 NY PAS 09/18/1971 40
## 4 1342296217 FYM5117 NY SRF 09/18/1971 21
## 5 1342296199 95V6675 TX PAS 09/18/1971 21
## 6 1342296187 GCY4187 NY SRF 09/18/1971 21
## 7 1337077380 18972BB NY 999 10/10/1971 14
## 8 1364523796 WNJ4730 VA PAS 04/05/1973 14
## 9 1359914924 68091JZ NY COM 07/22/1973 46
## 10 1355498326 EWV4127 NY PAS 08/12/1973 21
## # ... with 9,100,268 more rows, and 37 more variables: Vehicle_Body_Type ,
## # Vehicle_Make , Issuing_Agency , Street_Code1 , Street_Code2 ,
## # Street_Code3 , Vehicle_Expiration_Date , Violation_Location ,
## # Violation_Precinct , Issuer_Precinct , Issuer_Code , Issuer_Command ,
## # Issuer_Squad , Violation_Time , Time_First_Observed ,
## # Violation_County , Violation_In_Front_Of_Or_Opposite , House_Number ,
## # Street_Name , Intersecting_Street , Date_First_Observed ,
## # Law_Section , Sub_Division , Violation_Legal_Code ,
## # Days_Parking_In_Effect , From_Hours_In_Effect , To_Hours_In_Effect ,
## # Vehicle_Color , `Unregistered_Vehicle?` , Vehicle_Year ,
## # Meter_Number , Feet_From_Curb , Violation_Post_Code ,
## # Violation_Description , No_Standing_or_Stopping_Violation ,
## # Hydrant_Violation , Double_Parking_Violation
```
---
## Simplifying
There is a lot of variables we won't care about for the time being, so lets make life easier by selecting a subset of columns.
```{r eval=FALSE}
(nyc %<>%
select(Registration_State:Issuing_Agency,
Violation_Location, Violation_Precinct, Violation_Time,
House_Number:Intersecting_Street, Vehicle_Color))
## # A tibble: 9,100,278 x 14
## Registration_State Plate_Type Issue_Date Violation_Code Vehicle_Body_Type Vehicle_Make
##
## 1 NY PAS 12/18/1970 20 SUBN GMC
## 2 NY COM 02/02/1971 46 DELV FRUEH
## 3 NY PAS 09/18/1971 40 SDN MAZDA
## 4 NY SRF 09/18/1971 21 SUBN NISSA
## 5 TX PAS 09/18/1971 21 GMC
## 6 NY SRF 09/18/1971 21 VAN FORD
## 7 NY 999 10/10/1971 14 BUS INTER
## 8 VA PAS 04/05/1973 14 SDN TOYOT
## 9 NY COM 07/22/1973 46 DELV TOYOT
## 10 NY PAS 08/12/1973 21 SUBN ACURA
## # ... with 9,100,268 more rows, and 8 more variables: Issuing_Agency ,
## # Violation_Location , Violation_Precinct , Violation_Time ,
## # House_Number , Street_Name , Intersecting_Street , Vehicle_Color
```
---
## Fixing Dates
```{r, eval=FALSE}
library(lubridate)
class(nyc$Issue_Date)
## [1] "character"
nyc %<>% mutate(Issue_Date = mdy(Issue_Date, tz="America/New_York"))
class(nyc$Issue_Date)
## [1] "Date"
```
---
##
```{r, eval=FALSE}
nyc
## # A tibble: 9,100,278 × 14
## Registration.State Plate.Type Issue_Date Violation.Code Vehicle.Body.Type Vehicle.Make
##
## 1 NY PAS 1970-12-18 20 SUBN GMC
## 2 NY COM 1971-02-02 46 DELV FRUEH
## 3 NY PAS 1971-09-18 40 SDN MAZDA
## 4 NY SRF 1971-09-18 21 SUBN NISSA
## 5 TX PAS 1971-09-18 21 GMC
## 6 NY SRF 1971-09-18 21 VAN FORD
## 7 NY 999 1971-10-10 14 BUS INTER
## 8 VA PAS 1973-04-05 14 SDN TOYOT
## 9 NY COM 1973-07-22 46 DELV TOYOT
## 10 NY PAS 1973-08-12 21 SUBN ACURA
## # ... with 9,100,268 more rows, and 8 more variables: Issuing.Agency ,
## # Violation.Location , Violation.Precinct , Violation.Time ,
## # House.Number , Street.Name , Intersecting.Street , Vehicle.Color
```
---
## More fixing dates
```{r, eval=FALSE}
range(nyc$Issue_Date)
## [1] "1970-12-18 EST" "2069-12-23 EST"
nyc$Issue_Date %>% year() %>% table()
## 1970 1971 1973 1974 1976 1977 1979 1981 1983 1984 1987
## 1 6 10 1 2 1 2 4 1 2 3
## 1990 1991 1996 2000 2001 2002 2003 2004 2005 2006 2007
## 2 1 1 319 91 7 39 77 9 11 13
## 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
## 8 9 129 251 618 4379109 4716512 1522 296 309 181
## 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029
## 329 18 26 1 31 23 10 4 4 7 3
## 2030 2031 2032 2033 2040 2041 2043 2044 2045 2046 2047
## 45 93 3 8 1 39 9 9 2 7 6
## 2048 2049 2050 2051 2052 2053 2060 2061 2063 2064 2066
## 1 3 1 12 2 1 3 10 9 5 3
## 2067 2069
## 2 1
```
---
##
```{r, eval=FALSE}
filter(nyc, Issue_Date >= mdy("1/1/2013"), Issue_Date <= mdy("12/31/2014"))
## # A tibble: 9,095,621 x 43
## Summons_Number Plate_ID Registration_State Plate_Type Issue_Date Violation_Code Vehicle_Body_Type
##
## 1 1354300671 S28CMN NJ PAS 2013-01-01 20 SUBN
## 2 1349345910 XTX057 MI PAS 2013-01-01 99 SDN
## 3 1268869855 GJK5565 NY PAS 2013-01-01 20 SUBN
## 4 1268869843 EPS8803 NY PAS 2013-01-01 20 SDN
## 5 1365149122 FWZ5341 NY SRF 2013-01-01 20 SDN
## 6 1364348044 EPE8859 NY PAS 2013-01-01 46 SUBN
## 7 1364348032 ERT3706 NY PAS 2013-01-01 62 SDN
## 8 1364838760 DPA3951 NY PAS 2013-01-01 98 SDN
## 9 1364832835 ETS1289 NY PAS 2013-01-01 71 SDN
## 10 1364805819 FGE4351 NY PAS 2013-01-01 71 SUBN
## # ... with 9,095,611 more rows, and 36 more variables: Vehicle_Make , Issuing_Agency ,
## # Street_Code1 , Street_Code2 , Street_Code3 , Vehicle_Expiration_Date ,
## # Violation_Location , Violation_Precinct , Issuer_Precinct , Issuer_Code ,
## # Issuer_Command , Issuer_Squad , Violation_Time , Time_First_Observed ,
## # Violation_County , Violation_In_Front_Of_Or_Opposite , House_Number , Street_Name ,
## # Intersecting_Street , Date_First_Observed , Law_Section , Sub_Division ,
## # Violation_Legal_Code , Days_Parking_In_Effect , From_Hours_In_Effect ,
## # To_Hours_In_Effect , Vehicle_Color , `Unregistered_Vehicle?` , Vehicle_Year ,
## # Meter_Number , Feet_From_Curb , Violation_Post_Code , Violation_Description ,
## # No_Standing_or_Stopping_Violation , Hydrant_Violation , Double_Parking_Violation
```
---
## Performance?
```{r, eval=FALSE}
system.time(filter(nyc, Issue_Date >= mdy("1/1/2013"), Issue_Date <= mdy("12/31/2014")))
## user system elapsed
## 7.560 1.800 9.474
system.time(filter(nyc, year(Issue_Date) %in% c(2013,2014)))
## user system elapsed
## 7.003 1.761 8.794
```
---
## Putting it all together
```{r, eval=FALSE}
nyc = read_csv("/data/nyc_parking/NYParkingViolations.csv") %>%
setNames(str_replace_all(names(.)," ", "_")) %>%
select(Registration_State:Issuing_Agency,
Violation_Location, Violation_Precinct, Violation_Time,
House_Number:Intersecting_Street, Vehicle_Color) %>%
mutate(Issue_Date = mdy(Issue_Date)) %>%
mutate(Issue_Day = day(Issue_Date),
Issue_Month = month(Issue_Date),
Issue_Year = year(Issue_Date),
Issue_WDay = wday(Issue_Date, label=TRUE)) %>%
filter(Issue_Year %in% 2013:2014)
```
---
##
```{r, eval=FALSE}
## # A tibble: 9,095,621 x 18
## Registration_State Plate_Type Issue_Date Violation_Code Vehicle_Body_Type Vehicle_Make Issuing_Agency
##
## 1 NJ PAS 2013-01-01 20 SUBN CHRYS P
## 2 MI PAS 2013-01-01 99 SDN TOYOT P
## 3 NY PAS 2013-01-01 20 SUBN FORD P
## 4 NY PAS 2013-01-01 20 SDN INFIN P
## 5 NY SRF 2013-01-01 20 SDN NISSA P
## 6 NY PAS 2013-01-01 46 SUBN CHRYS P
## 7 NY PAS 2013-01-01 62 SDN HYUND P
## 8 NY PAS 2013-01-01 98 SDN TOYOT P
## 9 NY PAS 2013-01-01 71 SDN TOYOT P
## 10 NY PAS 2013-01-01 71 SUBN FORD P
## # ... with 9,095,611 more rows, and 11 more variables: Violation_Location , Violation_Precinct ,
## # Violation_Time , House_Number , Street_Name , Intersecting_Street , Vehicle_Color ,
## # Issue_Day , Issue_Month , Issue_Year , Issue_WDay
```
---
## Ticket Frequency
```{r, eval=FALSE}
nyc %>%
group_by(Issue_Date) %>%
summarize(n=n()) %>%
ggplot(aes(x=Issue_Date, y=n)) +
geom_line() +
xlim(mdy("7/1/2013"), mdy("6/30/2014"))
```
```{r echo=FALSE}
knitr::include_graphics("imgs/nyc_date_freq.png")
```
---
## Exercise 1
Some more dplyr practice,
1. Create a plot of the weekly pattern (tickets issued per day of the week) - When are you most likely to get a ticket and when are you least likely to get a ticket?
2. Which precinct issued the most tickets to Toyotas?
3. How many different colors of cars were ticketed?
---
class: middle
count: false
# dbplyr
---
## Creating an sqlite database
.normal[
```{r eval=FALSE}
(db = src_sqlite("/data/nyc_parking/NYParkingViolations.sqlite", create = TRUE))
## src: sqlite 3.8.6 [/data/nyc_parking/NYParkingViolations.sqlite]
## tbls:
nyc_sql = copy_to(db, nyc, temporary = FALSE)
db
## src: sqlite 3.8.6 [/data/nyc_parking/NYParkingViolations.sqlite]
## tbls: nyc, sqlite_stat1
nyc_sql = tbl(db,"nyc")
str(nyc_sql)
## List of 9
## $ src :List of 3
## ..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 5 slots
## .. .. ..@ Id :
## .. .. ..@ dbname : chr "/data/nyc_parking/NYParkingViolations.sqlite"
## .. .. ..@ loadable.extensions: logi TRUE
## .. .. ..@ flags : int 6
## .. .. ..@ vfs : chr ""
## ..$ path: chr "/data/nyc_parking/NYParkingViolations.sqlite"
## ..$ info:List of 2
## .. ..$ serverVersion: chr "3.8.6"
## .. ..$ results : logi FALSE
## ..- attr(*, "class")= chr [1:3] "src_sqlite" "src_sql" "src"
## $ from :Classes 'ident', 'sql', 'character' chr "nyc"
## ...
```
]
---
##
```{r eval=FALSE}
nyc_sql
## # Source: table [?? x 18]
## # Database: sqlite 3.19.3 [/data/nyc_parking/NYParkingViolations.sqlite]
## Registration_State Plate_Type Issue_Date Violation_Code Vehicle_Body_Type Vehicle_Make Issuing_Agency
##
## 1 NJ PAS 15706 20 SUBN CHRYS P
## 2 MI PAS 15706 99 SDN TOYOT P
## 3 NY PAS 15706 20 SUBN FORD P
## 4 NY PAS 15706 20 SDN INFIN P
## 5 NY SRF 15706 20 SDN NISSA P
## 6 NY PAS 15706 46 SUBN CHRYS P
## 7 NY PAS 15706 62 SDN HYUND P
## 8 NY PAS 15706 98 SDN TOYOT P
## 9 NY PAS 15706 71 SDN TOYOT P
## 10 NY PAS 15706 71 SUBN FORD P
## # ... with more rows, and 11 more variables: Violation_Location , Violation_Precinct ,
## # Violation_Time , House_Number , Street_Name , Intersecting_Street ,
## # Vehicle_Color , Issue_Day , Issue_Month , Issue_Year , Issue_WDay
```
---
## Using dplyr with sqlite
```{r, eval=FALSE}
(addr = nyc_sql %>%
select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>%
filter(Violation_Precinct >=1, Violation_Precinct <= 34)
)
## # Source: lazy query [?? x 5]
## # Database: sqlite 3.19.3 [/data/nyc_parking/NYParkingViolations.sqlite]
## Issue_Date Issuing_Agency Violation_Precinct House_Number Street_Name
##
## 1 15706 P 28 102 W 123 ST
## 2 15706 P 23 2121 1 AVE
## 3 15706 P 23 60 E 106 ST
## 4 15706 P 5 54 ELIZABETH ST
## 5 15706 P 26 488-490 ST NICHOLAS AVE
## 6 15706 P 26 1420 AMSTERDAM AVE
## 7 15706 P 25 219 E 121 ST
## 8 15706 P 12 630 LEXINGTON AVE
## 9 15706 P 18 413 48 TH ST
## 10 15706 P 25 2123 MADISON AVE---
class: middle
count: false
# ... with more rows
```
---
## SQL Query
```{r, eval=FALSE}
class(addr)
## [1] "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl"
show_query(addr)
##
## SELECT *
## FROM (SELECT `Issue_Date` AS `Issue_Date`, `Issuing_Agency` AS `Issuing_Agency`, `Violation_Precinct` AS `Violation_Precinct`, `House_Number` AS `House_Number`, `Street_Name` AS `Street_Name`
## FROM `nyc`)
## WHERE ((`Violation_Precinct` >= 1.0) AND (`Violation_Precinct` <= 34.0))
```
---
## Limitations
```{r, eval=FALSE}
addr %>% mutate(address = paste(House_Number, Street_Name))
## Error in rsqlite_send_query(conn@ptr, statement) :
## no such function: PASTE
addr %>% summarize(mean = mean(Violation_Precinct, na.rm=TRUE))
## na.rm not needed in SQL: NULL are always droppedFALSE
## # Source: lazy query [?? x 1]
## # Database: sqlite 3.19.3 [/data/nyc_parking/NYParkingViolations.sqlite]
## mean
##
## 1 16.09762
addr %>% summarize(mean = mean(Violation_Precinct))
## # Source: lazy query [?? x 1]
## # Database: sqlite 3.19.3 [/data/nyc_parking/NYParkingViolations.sqlite]
## mean
##
## 1 16.09762
```
---
## SQL Grouping
```{r, eval=FALSE}
addr %>% group_by(Issuing_Agency, Violation_Precinct) %>% summarize(n=n())
## # Source: lazy query [?? x 3]
## # Database: sqlite 3.19.3 [/data/nyc_parking/NYParkingViolations.sqlite]
## # Groups: Issuing_Agency
## Issuing_Agency Violation_Precinct n
##
## 1 A 1 13
## 2 A 7 1
## 3 A 10 24
## 4 A 11 1
## 5 A 14 47
## 6 A 33 11
## 7 B 25 2
## 8 C 5 73
## 9 C 13 7
## 10 D 1 1
## # ... with more rows
```
---
## SQL Query
```{r, eval=FALSE}
addr %>% group_by(Issuing_Agency, Violation_Precinct) %>% summarize(n=n()) %>% show_query()
##
## SELECT `Issuing_Agency`, `Violation_Precinct`, COUNT() AS `n`
## FROM (SELECT `Issue_Date` AS `Issue_Date`, `Issuing_Agency` AS `Issuing_Agency`, `Violation_Precinct` AS `## Violation_Precinct`, `House_Number` AS `House_Number`, `Street_Name` AS `Street_Name`
## FROM `nyc`)
## WHERE ((`Violation_Precinct` >= 1.0) AND (`Violation_Precinct` <= 34.0))
## GROUP BY `Issuing_Agency`, `Violation_Precinct`
```
---
## SQL Translation
dbplyr has a function, `translate_sql`, that lets you experiment with how R functions are translated to SQL
.pull-left[
```{r, error=TRUE}
dbplyr::translate_sql(x == 1 & (y < 2 | z > 3))
dbplyr::translate_sql(x ^ 2 < 10)
dbplyr::translate_sql(x %% 2 == 10)
```
]
.pull-right[
```{r error=TRUE}
dbplyr::translate_sql(paste(x,y))
dbplyr::translate_sql(mean(x))
dbplyr::translate_sql(mean(x, na.rm=TRUE))
```
]
In general, dplyr / dbplyr knows how to translate basic math, logical, and summary functions from R to SQL.
---
## (Unfair) Timings
```{r, eval=FALSE}
system.time(
nyc %>%
select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>%
filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
group_by(Issuing_Agency, Violation_Precinct) %>%
summarize(n=n())
)
## user system elapsed
## 0.639 0.099 0.740
system.time(
nyc_sql %>%
select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>%
filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
group_by(Issuing_Agency, Violation_Precinct) %>%
summarize(n=n())
)
## user system elapsed
## 0.045 0.011 0.056
```
`nyc_sql` was 30x times faster than `nyc`, but the former is disk based while the latter is in memory, why this discrepancy?
---
## Laziness
dplyr / dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.
* When building a query, we don't want the entire table, often we want just enough to check if our query is working.
* Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.
* Therefore, by default `dplyr`
* won't connect and query the database until absolutely necessary (e.g. show output),
* and unless explicitly told to, will only query a handful of rows to give a sense of what the result will look like
---
##
```{r eval=FALSE}
nyc_sql %>%
select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>%
filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
group_by(Issuing_Agency, Violation_Precinct) %>%
summarize(n=n())
## # Source: lazy query [?? x 3]
## # Database: sqlite 3.19.3 [/data/nyc_parking/NYParkingViolations.sqlite]
## # Groups: Issuing_Agency
## Issuing_Agency Violation_Precinct n
##
## 1 A 1 13
## 2 A 7 1
## 3 A 10 24
## 4 A 11 1
## 5 A 14 47
## 6 A 33 11
## 7 B 25 2
## 8 C 5 73
## 9 C 13 7
## 10 D 1 1
## # ... with more rows
```
---
## Full query
To force a full query and return a complete it is necessart to use the `collect` function.
```{r, eval=FALSE}
nyc_sql %>%
select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>%
filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
group_by(Issuing_Agency, Violation_Precinct) %>%
summarize(n=n()) %>%
collect()
## # A tibble: 199 x 3
## # Groups: Issuing_Agency [15]
## Issuing_Agency Violation_Precinct n
##
## 1 A 1 13
## 2 A 7 1
## 3 A 10 24
## 4 A 11 1
## 5 A 14 47
## 6 A 33 11
## 7 B 25 2
## 8 C 5 73
## 9 C 13 7
## 10 D 1 1
## # ... with 189 more rows
```
`compute` and `collapse` also force a full query but have slightly different behavior and return types.
---
## Creating Indexes
```{r, eval=FALSE}
(db_index = src_sqlite("/data/nyc_parking/NYParkingViolations_index.sqlite", create = TRUE))
## src: sqlite 3.8.6 [/data/nyc_parking/NYParkingViolations_index.sqlite]
## tbls:
nyc_index = copy_to(db_index, nyc, temporary = FALSE, overwrite = TRUE,
index = list("Violation_Precinct"))
```
---
The indexed database takes up more disk space:
```shell
cr173@saxon [~]$ ls -lh /data/nyc_parking/*.sqlite
-rwxr--r--. 1 cr173 visitor 1.8G Nov 6 12:18 /data/nyc_parking/NYParkingViolations_index.sqlite
-rwxr--r--. 1 cr173 visitor 1.7G Nov 6 12:01 /data/nyc_parking/NYParkingViolations.sqlite
```
---
## Timings for filtering precincts
```{r, eval=FALSE}
system.time(nyc_sql %>% filter(Violation_Precinct <= 34, Violation_Precinct >= 1) %>% collect())
## user system elapsed
## 22.204 1.524 23.832
system.time(nyc_index %>% filter(Violation_Precinct <= 34, Violation_Precinct >= 1) %>% collect())
## user system elapsed
## 21.547 1.647 23.284
```
---
## Timings for grouping
```{r, eval=FALSE}
system.time(nyc_sql %>% group_by(Violation_Precinct) %>% summarize(n=n()) %>% collect())
## user system elapsed
## 5.131 0.652 5.804
system.time(nyc_index %>% group_by(Violation_Precinct) %>% summarize(n=n()) %>% collect())
## user system elapsed
## 1.227 0.082 1.314
```
---
class: middle
count: false
# Acknowledgments
---
## Acknowledgments
Above materials are derived in part from the following sources:
* [dbplyr - Introduction Vignette](https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html)
* [dbplyr - SQL Translation](https://cran.r-project.org/web/packages/dbplyr/vignettes/sql-translation.html)