---
title: Bigish data
author: "Colin Rundel"
date: "2018-11-05"
output:
xaringan::moon_reader:
css: "slides.css"
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
exclude: true
```{r setup, echo=FALSE, message=FALSE, warning=FALSE, include=FALSE}
options(
htmltools.dir.version = FALSE, # for blogdown
width = 80,
tibble.width = 80
)
knitr::opts_chunk$set(
fig.align = "center"
)
htmltools::tagList(rmarkdown::html_dependency_font_awesome())
library(dplyr)
```
---
class: middle
count: false
# Background
---
## Big vs Bigish data
* We will be working with data that is large, but will still fit in memory.
* R *loves* to make extra copies of objects, so we need to be careful - even a handful of copies with exhaust the memory on most systems.
* Less of an issue on saxon (256 GB of Ram), but this is a shared resource - use too much and your session will be killed.
* In general you should prefer:
.center[
*subsetting/vectorization >> apply > loops*
*built-in/base > user C/C++ functions > user R functions*
]
---
class: middle
count: false
# Reading Data
---
## Reading parking data
If we use the basic approach of `read.csv`, we end up waiting a really long time,
```{r, eval=FALSE}
system.time(read.csv("/data/nyc_parking/nyc_parking_2014.csv"))
## user system elapsed
## 377.359 7.080 384.411
```
Over 6 minutes to read in a 1.7 gigabyte CSV file.
---
## Improvements
If we use `stringsAsFactors` and `comment.char` arguments we can speed things up a bit.
```{r, eval=FALSE}
system.time(
read.csv(
"/data/nyc_parking/nyc_parking_2014.csv",
stringsAsFactors=FALSE,
comment.char=""
)
)
## user system elapsed
## 281.399 4.615 285.975
```
We can take this farther by specifying the structure of the data using the `colClasses` argument.
---
## Alternatives - data.table
.small[
```{r, eval=FALSE}
system.time({
nyc_fread = data.table::fread("/data/nyc_parking/nyc_parking_2014.csv")
})
## |--------------------------------------------------|
## |==================================================|
## user system elapsed
## 44.636 2.393 24.606
class(nyc_fread)
## "data.table" "data.frame"
nyc = as.data.frame(nyc_fread)
class(nyc)
## [1] "data.frame"
```
]
---
## Alternatives - readr
.small[
```{r, eval=FALSE}
system.time({
nyc = readr::read_csv("/data/nyc_parking/nyc_parking_2014.csv")
})
## Parsed with column specification:
## cols(
## .default = col_character(),
## `Summons Number` = col_double(),
## `Violation Code` = col_integer(),
## `Street Code1` = col_integer(),
## `Street Code2` = col_integer(),
## `Street Code3` = col_integer(),
## `Vehicle Expiration Date` = col_integer(),
## `Violation Precinct` = col_integer(),
## `Issuer Precinct` = col_integer(),
## `Issuer Code` = col_integer(),
## `Date First Observed` = col_integer(),
## `Law Section` = col_integer(),
## `Violation Legal Code` = col_integer(),
## `Unregistered Vehicle?` = col_integer(),
## `Vehicle Year` = col_integer(),
## `Feet From Curb` = col_integer()
## )
## See spec(...) for full column specifications.
## |========================================| 100% 1713 MB
## user system elapsed
## 97.474 10.719 129.293
```
]
---
## Problems
.small[
```{r, eval=FALSE}
readr::problems(nyc)
## # A tibble: 654,437 x 5
## row col expected actual file
##
## 1 2647 Violation Legal Code an integer T '/data/nyc_parking/nyc_parking_2014.csv'
## 2 3792 Violation Legal Code an integer T '/data/nyc_parking/nyc_parking_2014.csv'
## 3 4001 Violation Legal Code an integer T '/data/nyc_parking/nyc_parking_2014.csv'
## 4 4002 Violation Legal Code an integer T '/data/nyc_parking/nyc_parking_2014.csv'
## 5 4003 Violation Legal Code an integer T '/data/nyc_parking/nyc_parking_2014.csv'
## 6 4004 Violation Legal Code an integer T '/data/nyc_parking/nyc_parking_2014.csv'
## 7 4005 Violation Legal Code an integer T '/data/nyc_parking/nyc_parking_2014.csv'
## 8 4006 Violation Legal Code an integer T '/data/nyc_parking/nyc_parking_2014.csv'
## 9 4019 Violation Legal Code an integer T '/data/nyc_parking/nyc_parking_2014.csv'
## 10 4059 Violation Legal Code an integer T '/data/nyc_parking/nyc_parking_2014.csv'
## # ... with 654,427 more rows
```
]
---
## readr
This is a recent package that is designed to be a fast and friendly way of reading tabular data into R.
Core features:
* Faster than base R (~3-4x)
* No strings as factors
* No column name mangling
* Consistent argument/function naming scheme
* Plays nice with dplyr (`tbl_df`)
* Progress bars
---
.small[
```{r, eval=FALSE}
nyc
## Source: local data frame [9,100,278 x 43]
##
## Summons Number Plate ID Registration State Plate Type issue_date Violation Code
## (dbl) (chr) (chr) (chr) (chr) (int)
## 1 1361929741 FCJ5493 NY PAS 12/18/1970 20
## 2 1366962000 63540MC NY COM 02/02/1971 46
## 3 1356906515 GFM1421 NY PAS 09/18/1971 40
## 4 1342296217 FYM5117 NY SRF 09/18/1971 21
## 5 1342296199 95V6675 TX PAS 09/18/1971 21
## 6 1342296187 GCY4187 NY SRF 09/18/1971 21
## 7 1337077380 18972BB NY 999 10/10/1971 14
## 8 1364523796 WNJ4730 VA PAS 04/05/1973 14
## 9 1359914924 68091JZ NY COM 07/22/1973 46
## 10 1355498326 EWV4127 NY PAS 08/12/1973 21
## .. ... ... ... ... ... ...
## Variables not shown: Vehicle Body Type (chr), Vehicle Make (chr), Issuing Agency (chr),
## Street Code1 (int), Street Code2 (int), Street Code3 (int), Vehicle Expiration Date
## (int), Violation Location (chr), Violation Precinct (int), Issuer Precinct (int),
## Issuer Code (int), Issuer Command (chr), Issuer Squad (chr), Violation Time (chr), Time
## First Observed (chr), Violation County (chr), Violation In Front Of Or Opposite (chr),
## House Number (chr), Street Name (chr), Intersecting Street (chr), Date First Observed
## (int), Law Section (int), Sub Division (chr), Violation Legal Code (int), Days Parking
## In Effect (chr), From Hours In Effect (chr), To Hours In Effect (chr), Vehicle Color
## (chr), Unregistered Vehicle? (int), Vehicle Year (int), Meter Number (chr), Feet From
## Curb (int), Violation Post Code (chr), Violation Description (chr), No Standing or
## Stopping Violation (chr), Hydrant Violation (chr), Double Parking Violation (chr)
```
]
---
## Fixing column names
.small[
```{r eval=FALSE}
(nyc = janitor::clean_names(nyc))
## # A tibble: 9,100,278 x 43
## summons_number plate_id registration_st… plate_type issue_date violation_code
##
## 1 1361929741 FCJ5493 NY PAS 12/18/1970 20
## 2 1366962000 63540MC NY COM 02/02/1971 46
## 3 1356906515 GFM1421 NY PAS 09/18/1971 40
## 4 1342296217 FYM5117 NY SRF 09/18/1971 21
## 5 1342296199 95V6675 TX PAS 09/18/1971 21
## 6 1342296187 GCY4187 NY SRF 09/18/1971 21
## 7 1337077380 18972BB NY 999 10/10/1971 14
## 8 1364523796 WNJ4730 VA PAS 04/05/1973 14
## 9 1359914924 68091JZ NY COM 07/22/1973 46
## 10 1355498326 EWV4127 NY PAS 08/12/1973 21
## # ... with 9,100,268 more rows, and 37 more variables: vehicle_body_type ,
## # vehicle_make , issuing_agency , street_code1 , street_code2 ,
## # street_code3 , vehicle_expiration_date , violation_location ,
## # violation_precinct , issuer_precinct , issuer_code ,
## # issuer_command , issuer_squad , violation_time ,
## # time_first_observed , violation_county ,
## # violation_in_front_of_or_opposite , house_number , street_name ,
## # intersecting_street , date_first_observed , law_section ,
## # sub_division , violation_legal_code , days_parking_in_effect ,
## # from_hours_in_effect , to_hours_in_effect , vehicle_color ,
## # unregistered_vehicle , vehicle_year , meter_number ,
## # feet_from_curb , violation_post_code , violation_description ,
## # no_standing_or_stopping_violation , hydrant_violation ,
## # double_parking_violation
```
]
---
## Simplifying
There is a lot of variables we won't care about for the time being, so lets make life easier by selecting a subset of columns.
.small[
```{r eval=FALSE}
(nyc_trim = nyc %>%
select(registration_state:issuing_agency,
violation_location, violation_precinct, violation_time,
house_number:intersecting_street, vehicle_color))
## # A tibble: 9,100,278 x 14
## registration_st… plate_type issue_date violation_code vehicle_body_ty… vehicle_make
##
## 1 NY PAS 12/18/1970 20 SUBN GMC
## 2 NY COM 02/02/1971 46 DELV FRUEH
## 3 NY PAS 09/18/1971 40 SDN MAZDA
## 4 NY SRF 09/18/1971 21 SUBN NISSA
## 5 TX PAS 09/18/1971 21 NA GMC
## 6 NY SRF 09/18/1971 21 VAN FORD
## 7 NY 999 10/10/1971 14 BUS INTER
## 8 VA PAS 04/05/1973 14 SDN TOYOT
## 9 NY COM 07/22/1973 46 DELV TOYOT
## 10 NY PAS 08/12/1973 21 SUBN ACURA
## # ... with 9,100,268 more rows, and 8 more variables: issuing_agency ,
## # violation_location , violation_precinct , violation_time ,
## # house_number , street_name , intersecting_street ,
## # vehicle_color
```
]
---
## Object Sizes
```shell
cr173@gort [nyc_parking]$ ls -lah
total 1.7G
drwxr-xr-x 4 cr173 visitor 4.0K Nov 5 11:55 .
drwxrwxrwx 3 root root 4.0K Nov 5 12:47 ..
-rwxr--r-- 1 cr173 visitor 14K Nov 5 11:53 fine_definition.csv
drwxr-xr-x 2 cr173 visitor 4.0K Nov 5 11:53 nybb
-rwxr--r-- 1 cr173 visitor 1.7G Nov 5 11:53 nyc_parking_2014.csv
drwxrwxr-x 2 cr173 visitor 4.0K Nov 5 11:53 pluto_manhattan
```
```{r eval=FALSE}
pryr::object_size(nyc)
## 2.83 GB
pryr::object_size(nyc_fread)
## 2.69 GB
pryr::object_size(nyc_trim)
## 998 MB
```
---
## Clean data?
How many different car colors are in this data set?
--
.pull-left[ .small[
```{r eval=FALSE}
nyc %>%
count(vehicle_color) %>%
arrange(desc(n))
## # A tibble: 2,891 x 2
## vehicle_color n
##
## 1 WHITE 1348510
## 2 GY 1214213
## 3 WH 1192609
## 4 BK 941007
## 5 BLACK 665194
## 6 BL 442368
## 7 GREY 417142
## 8 SILVE 313770
## 9 BLUE 301119
## 10 RD 272772
## # ... with 2,881 more rows
```
] ]
.pull-right[ .small[
```{r eval=FALSE}
nyc %>%
count(vehicle_color)
## # A tibble: 2,891 x 2
## vehicle_color n
##
## 1 - 5
## 2 -- 2
## 3 --- 1
## 4 ---- 1
## 5 ----- 1
## 6 -. 1
## 7 , 1
## 8 ,.A 1
## 9 ,.J., 1
## 10 ,SILV 2
## # ... with 2,881 more rows
```
] ]
---
## Fixing Dates
```{r, eval=FALSE}
library(lubridate)
class(nyc$issue_date)
## [1] "character"
nyc = nyc %>% mutate(issue_date = mdy(issue_date, tz="America/New_York"))
class(nyc$issue_date)
## [1] "POSIXct" "POSIXt"
head(nyc$issue_date)
## [1] "1970-12-18 EST" "1971-02-02 EST" "1971-09-18 EDT" "1971-09-18 EDT" "1971-09-18 EDT" "1971-09-18 EDT"
```
---
.small[
```{r, eval=FALSE}
nyc
## # A tibble: 9,100,278 x 43
## summons_number plate_id registration_st… plate_type issue_date
##
## 1 1361929741 FCJ5493 NY PAS 1970-12-18 00:00:00
## 2 1366962000 63540MC NY COM 1971-02-02 00:00:00
## 3 1356906515 GFM1421 NY PAS 1971-09-18 00:00:00
## 4 1342296217 FYM5117 NY SRF 1971-09-18 00:00:00
## 5 1342296199 95V6675 TX PAS 1971-09-18 00:00:00
## 6 1342296187 GCY4187 NY SRF 1971-09-18 00:00:00
## 7 1337077380 18972BB NY 999 1971-10-10 00:00:00
## 8 1364523796 WNJ4730 VA PAS 1973-04-05 00:00:00
## 9 1359914924 68091JZ NY COM 1973-07-22 00:00:00
## 10 1355498326 EWV4127 NY PAS 1973-08-12 00:00:00
## # ... with 9,100,268 more rows, and 38 more variables: violation_code ,
## # vehicle_body_type , vehicle_make , issuing_agency ,
## # street_code1 , street_code2 , street_code3 ,
## # vehicle_expiration_date , violation_location ,
## # violation_precinct , issuer_precinct , issuer_code ,
## # issuer_command , issuer_squad , violation_time ,
## # time_first_observed , violation_county ,
## # violation_in_front_of_or_opposite , house_number ,
## # street_name , intersecting_street , date_first_observed ,
## # law_section , sub_division , violation_legal_code ,
## # days_parking_in_effect , from_hours_in_effect ,
## # to_hours_in_effect , vehicle_color , unregistered_vehicle ,
## # vehicle_year , meter_number , feet_from_curb ,
## # violation_post_code , violation_description ,
## # no_standing_or_stopping_violation , hydrant_violation ,
## # double_parking_violation
```
]
---
## More fixing dates
.small[
```{r, eval=FALSE}
range(nyc$issue_date)
## [1] "1970-12-18 EST" "2069-12-23 EST"
nyc$issue_date %>% year() %>% table()
## 1970 1971 1973 1974 1976 1977 1979 1981 1983 1984 1987
## 1 6 10 1 2 1 2 4 1 2 3
## 1990 1991 1996 2000 2001 2002 2003 2004 2005 2006 2007
## 2 1 1 319 91 7 39 77 9 11 13
## 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
## 8 9 129 251 618 4379109 4716512 1522 296 309 181
## 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029
## 329 18 26 1 31 23 10 4 4 7 3
## 2030 2031 2032 2033 2040 2041 2043 2044 2045 2046 2047
## 45 93 3 8 1 39 9 9 2 7 6
## 2048 2049 2050 2051 2052 2053 2060 2061 2063 2064 2066
## 1 3 1 12 2 1 3 10 9 5 3
## 2067 2069
## 2 1
```
]
---
.small[
```{r, eval=FALSE}
filter(nyc, issue_date >= mdy("1/1/2013"), issue_date <= mdy("12/31/2014"))
## # A tibble: 9,095,621 x 43
## Summons_Number Plate_ID registration_state Plate_Type issue_date Violation_Code Vehicle_Body_Type
##
## 1 1354300671 S28CMN NJ PAS 2013-01-01 20 SUBN
## 2 1349345910 XTX057 MI PAS 2013-01-01 99 SDN
## 3 1268869855 GJK5565 NY PAS 2013-01-01 20 SUBN
## 4 1268869843 EPS8803 NY PAS 2013-01-01 20 SDN
## 5 1365149122 FWZ5341 NY SRF 2013-01-01 20 SDN
## 6 1364348044 EPE8859 NY PAS 2013-01-01 46 SUBN
## 7 1364348032 ERT3706 NY PAS 2013-01-01 62 SDN
## 8 1364838760 DPA3951 NY PAS 2013-01-01 98 SDN
## 9 1364832835 ETS1289 NY PAS 2013-01-01 71 SDN
## 10 1364805819 FGE4351 NY PAS 2013-01-01 71 SUBN
## # ... with 9,095,611 more rows, and 36 more variables: Vehicle_Make , issuing_agency ,
## # Street_Code1 , Street_Code2 , Street_Code3 , Vehicle_Expiration_Date ,
## # violation_location , violation_precinct , Issuer_Precinct , Issuer_Code ,
## # Issuer_Command , Issuer_Squad , violation_time , Time_First_Observed ,
## # Violation_County , Violation_In_Front_Of_Or_Opposite , house_number , street_name ,
## # intersecting_street , Date_First_Observed , Law_Section , Sub_Division ,
## # Violation_Legal_Code , Days_Parking_In_Effect , From_Hours_In_Effect ,
## # To_Hours_In_Effect , vehicle_color , `Unregistered_Vehicle?` , Vehicle_Year ,
## # Meter_Number , Feet_From_Curb , Violation_Post_Code , Violation_Description ,
## # No_Standing_or_Stopping_Violation , Hydrant_Violation , Double_Parking_Violation
```
]
---
## Performance?
```{r, eval=FALSE}
system.time(filter(nyc, issue_date >= mdy("1/1/2013"), issue_date <= mdy("12/31/2014")))
## user system elapsed
## 4.831 1.566 6.427
system.time(filter(nyc, year(issue_date) %in% c(2013,2014)))
## user system elapsed
## 6.864 1.952 8.879
```
---
## Putting it all together
.small[
```{r, eval=FALSE}
nyc = readr::read_csv("/data/nyc_parking/nyc_parking_2014.csv") %>%
janitor::clean_names() %>%
select(registration_state:issuing_agency,
violation_location, violation_precinct, violation_time,
number:intersecting_street, vehicle_color) %>%
mutate(issue_date = mdy(issue_date)) %>%
mutate(issue_day = day(issue_date),
issue_month = month(issue_date),
issue_year = year(issue_date),
issue_wday = wday(issue_date, label=TRUE)) %>%
filter(issue_year %in% 2013:2014)
nyc
## # A tibble: 9,095,621 x 47
## summons_number plate_id registration_st… plate_type issue_date violation_code
##
## 1 1354300671 S28CMN NJ PAS 2013-01-01 00:00:00 20
## 2 1349345910 XTX057 MI PAS 2013-01-01 00:00:00 99
## 3 1268869855 GJK5565 NY PAS 2013-01-01 00:00:00 20
## 4 1268869843 EPS8803 NY PAS 2013-01-01 00:00:00 20
## 5 1365149122 FWZ5341 NY SRF 2013-01-01 00:00:00 20
## 6 1364348044 EPE8859 NY PAS 2013-01-01 00:00:00 46
## 7 1364348032 ERT3706 NY PAS 2013-01-01 00:00:00 62
## 8 1364838760 DPA3951 NY PAS 2013-01-01 00:00:00 98
## 9 1364832835 ETS1289 NY PAS 2013-01-01 00:00:00 71
## 10 1364805819 FGE4351 NY PAS 2013-01-01 00:00:00 71
## # ... with 9,095,611 more rows, and 41 more variables: vehicle_body_type , vehicle_make ,
## # issuing_agency , street_code1 , street_code2 , street_code3 ,
## # vehicle_expiration_date , violation_location , violation_precinct ,
## # issuer_precinct , issuer_code , issuer_command , issuer_squad ,
## # violation_time , time_first_observed , violation_county ,
## # violation_in_front_of_or_opposite , house_number , street_name ,
## # intersecting_street , date_first_observed , law_section , sub_division ,
## # violation_legal_code , days_parking_in_effect , from_hours_in_effect ,
## # to_hours_in_effect , vehicle_color , unregistered_vehicle , vehicle_year ,
## # meter_number , feet_from_curb , violation_post_code ,
## # violation_description , no_standing_or_stopping_violation , hydrant_violation ,
## # double_parking_violation , issue_day , issue_month , issue_year ,
## # issue_wday
```
]
---
## Ticket Frequency
```{r, eval=FALSE}
nyc %>%
group_by(issue_date) %>%
summarize(n=n()) %>%
ggplot(aes(x=issue_date, y=n)) +
geom_line() +
xlim(mdy("7/1/2013"), mdy("6/30/2014"))
```
```{r echo=FALSE}
knitr::include_graphics("imgs/nyc_date_freq.png")
```
---
## Exercise 1
Some more dplyr practice,
1. Create a plot of the weekly pattern (tickets issued per day of the week) - When are you most likely to get a ticket and when are you least likely to get a ticket?
2. Which precinct issued the most tickets to Toyotas?
---
class: middle
count: false
# dbplyr
---
## Creating a database
.small[
```{r eval=FALSE}
(db = dplyr::src_sqlite("/data/nyc_parking/nyc_parking_2014_cleaned.sqlite", create = TRUE))
## src: sqlite 3.22.0 [/data/nyc_parking/nyc_parking_2014_cleaned.sqlite]
## tbls:
```
]
--
.small[
```{r eval=FALSE}
nyc_sql = dplyr::copy_to(db, nyc, temporary = FALSE)
db
## src: sqlite 3.22.0 [/data/nyc_parking/nyc_parking_2014_cleaned.sqlite]
## tbls: nyc, sqlite_stat1, sqlite_stat4
```
]
--
.small[
```r
pryr::object_size(db)
## 6.54 kB
pryr::object_size(nyc_sql)
## 9.54 kB
```
```shell
> ls -lah /data/nyc_parking/*.sqlite
-rw-r--r-- 1 cr173 visitor 698M Nov 5 13:57 /data/nyc_parking/nyc_parking_2014_cleaned.sqlite
```
]
---
.small[
```{r eval=FALSE}
nyc_sql = dplyr::tbl(db,"nyc")
class(nyc_sql)
## [1] "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl"
```
]
--
.small[
```r
nyc_sql
## # Source: table [?? x 18]
## # Database: sqlite 3.22.0 [/data/nyc_parking/nyc_parking_2014_cleaned.sqlite]
## registration_state Plate_Type issue_date Violation_Code Vehicle_Body_Type Vehicle_Make issuing_agency
##
## 1 NJ PAS 15706 20 SUBN CHRYS P
## 2 MI PAS 15706 99 SDN TOYOT P
## 3 NY PAS 15706 20 SUBN FORD P
## 4 NY PAS 15706 20 SDN INFIN P
## 5 NY SRF 15706 20 SDN NISSA P
## 6 NY PAS 15706 46 SUBN CHRYS P
## 7 NY PAS 15706 62 SDN HYUND P
## 8 NY PAS 15706 98 SDN TOYOT P
## 9 NY PAS 15706 71 SDN TOYOT P
## 10 NY PAS 15706 71 SUBN FORD P
## # ... with more rows, and 11 more variables: violation_location , violation_precinct ,
## # violation_time , house_number , street_name , intersecting_street ,
## # vehicle_color , issue_day , issue_month , issue_year , issue_wday
```
]
--
.small[
```r
str(nyc_sql)
## List of 2
## $ src:List of 2
## ..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 7 slots
## .. .. ..@ ptr :
## .. .. ..@ dbname : chr "/data/nyc_parking/nyc_parking_2014_cleaned.sqlite"
## .. .. ..@ loadable.extensions: logi TRUE
## .. .. ..@ flags : int 70
## .. .. ..@ vfs : chr ""
## .. .. ..@ ref :
## .. .. ..@ bigint : chr "integer64"
## ..$ disco:
## ..- attr(*, "class")= chr [1:3] "src_dbi" "src_sql" "src"
## $ ops:List of 2
## ..$ x : 'ident' chr "nyc"
## ..$ vars: chr [1:18] "registration_state" "plate_type" "issue_date" "violation_code" ...
## ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
## - attr(*, "class")= chr [1:4] "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl"
```
]
---
## Using dplyr with sqlite
.small[
```{r, eval=FALSE}
(addr = nyc_sql %>%
select(issue_date, issuing_agency, violation_precinct, house_number, street_name) %>%
filter(violation_precinct >=1, violation_precinct <= 34)
)
## # Source: lazy query [?? x 5]
## # Database: sqlite 3.22.0 [/data/nyc_parking/nyc_parking_2014_cleaned.sqlite]
## issue_date issuing_agency violation_precinct house_number street_name
##
## 1 15706 P 28 102 W 123 ST
## 2 15706 P 23 2121 1 AVE
## 3 15706 P 23 60 E 106 ST
## 4 15706 P 5 54 ELIZABETH ST
## 5 15706 P 26 488-490 ST NICHOLAS AVE
## 6 15706 P 26 1420 AMSTERDAM AVE
## 7 15706 P 25 219 E 121 ST
## 8 15706 P 12 630 LEXINGTON AVE
## 9 15706 P 18 413 48 TH ST
## 10 15706 P 25 2123 MADISON AVE
## # ... with more rows
```
]
---
## SQL Query
.small[
```{r, eval=FALSE}
class(addr)
## [1] "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl"
show_query(addr)
##
## SELECT *
## FROM (SELECT `issue_date`, `issuing_agency`, `violation_precinct`, `house_number`, `street_name`
## FROM `nyc`)
## WHERE ((`violation_precinct` >= 1.0) AND (`violation_precinct` <= 34.0))
```
]
---
## SQL Grouping
```{r, eval=FALSE}
addr %>% group_by(issuing_agency, violation_precinct) %>% summarize(n=n())
## # Source: lazy query [?? x 3]
## # Database: sqlite 3.22.0 [/data/nyc_parking/nyc_parking_2014_cleaned.sqlite]
## # Groups: issuing_agency
## issuing_agency violation_precinct n
##
## 1 A 1 13
## 2 A 7 1
## 3 A 10 24
## 4 A 11 1
## 5 A 14 47
## 6 A 33 11
## 7 B 25 2
## 8 C 5 73
## 9 C 13 7
## 10 D 1 1
## # ... with more rows
```
---
## SQL Query
```{r, eval=FALSE}
addr %>% group_by(issuing_agency, violation_precinct) %>% summarize(n=n()) %>% show_query()
##
## SELECT `issuing_agency`, `violation_precinct`, COUNT() AS `n`
## FROM (SELECT `issue_date` AS `issue_date`, `issuing_agency` AS `issuing_agency`, `violation_precinct` AS `## violation_precinct`, `house_number` AS `house_number`, `street_name` AS `street_name`
## FROM `nyc`)
## WHERE ((`violation_precinct` >= 1.0) AND (`violation_precinct` <= 34.0))
## GROUP BY `issuing_agency`, `violation_precinct`
```
---
## SQL Translation
.small[
In general, dplyr / dbplyr knows how to translate basic math, logical, and summary functions from R to SQL.
dbplyr has a function, `translate_sql`, that lets you experiment with how R functions are translated to SQL.
```{r, error=TRUE}
dbplyr::translate_sql(x == 1 & (y < 2 | z > 3))
dbplyr::translate_sql(x ^ 2 < 10)
dbplyr::translate_sql(x %% 2 == 10)
```
```{r error=TRUE}
dbplyr::translate_sql(paste(x,y))
dbplyr::translate_sql(mean(x))
```
]
---
.small[
```{r error=TRUE}
dbplyr::translate_sql(sd(x))
dbplyr::translate_sql(paste(x,y))
dbplyr::translate_sql(cumsum(x))
dbplyr::translate_sql(lag(x))
dbplyr::translate_sql(lm(y~x))
```
]
---
## Complications
.small[
```{r, eval=FALSE}
addr %>% mutate(address = paste(house_number, street_name))
## # Source: lazy query [?? x 6]
## # Database: sqlite 3.22.0 [/data/nyc_parking/nyc_parking_2014_cleaned.sqlite]
## issue_date issuing_agency violation_precinct house_number street_name address
##
## 1 15706 P 28 102 W 123 ST 102 W 123 ST
## 2 15706 P 23 2121 1 AVE 2121 1 AVE
## 3 15706 P 23 60 E 106 ST 60 E 106 ST
## 4 15706 P 5 54 ELIZABETH ST 54 ELIZABETH ST
## 5 15706 P 26 488-490 ST NICHOLAS AVE 488-490 ST NICHOLAS AVE
## 6 15706 P 26 1420 AMSTERDAM AVE 1420 AMSTERDAM AVE
## 7 15706 P 25 219 E 121 ST 219 E 121 ST
## 8 15706 P 12 630 LEXINGTON AVE 630 LEXINGTON AVE
## 9 15706 P 18 413 48 TH ST 413 48 TH ST
## 10 15706 P 25 2123 MADISON AVE 2123 MADISON AVE
## # ... with more rows
addr %>% mutate(address = paste(house_number, street_name)) %>% show_query()
##
## SELECT `issue_date`, `issuing_agency`, `violation_precinct`, `house_number`, `street_name`,
## `house_number` || ' ' || ## `street_name` AS `address`
## FROM (SELECT `issue_date`, `issuing_agency`, `violation_precinct`, `house_number`, `street_name`
## FROM `nyc`)
## WHERE ((`violation_precinct` >= 1.0) AND (`violation_precinct` <= 34.0))
```
]
---
## (Unfair) Timings
.small[
```{r, eval=FALSE}
system.time(
nyc %>%
select(issue_date, issuing_agency, violation_precinct, house_number, street_name) %>%
filter(violation_precinct >=1, violation_precinct <= 34) %>%
group_by(issuing_agency, violation_precinct) %>%
summarize(n=n())
)
## user system elapsed
## 0.639 0.099 0.740
```
]
.small[
```{r, eval=FALSE}
system.time(
nyc_sql %>%
select(issue_date, issuing_agency, violation_precinct, house_number, street_name) %>%
filter(violation_precinct >=1, violation_precinct <= 34) %>%
group_by(issuing_agency, violation_precinct) %>%
summarize(n=n())
)
## user system elapsed
## 0.024 0.011 0.034
```
]
--
`nyc_sql` was ~22x times faster than `nyc`, but the former is disk based while the latter is in memory, why this discrepancy?
---
## Laziness
dplyr / dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.
* When building a query, we don't want the entire table, often we want just enough to check if our query is working.
* Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.
* Therefore, by default `dplyr`
* won't connect and query the database until absolutely necessary (e.g. show output),
* and unless explicitly told to, will only query a handful of rows to give a sense of what the result will look like
---
.small[
```{r eval=FALSE}
nyc_sql %>%
select(issue_date, issuing_agency, violation_precinct, house_number, street_name) %>%
filter(violation_precinct >=1, violation_precinct <= 34) %>%
group_by(issuing_agency, violation_precinct) %>%
summarize(n=n())
## # Source: lazy query [?? x 3]
## # Database: sqlite 3.22.0 [/data/nyc_parking/nyc_parking_2014_cleaned.sqlite]
## # Groups: issuing_agency
## issuing_agency violation_precinct n
##
## 1 A 1 13
## 2 A 7 1
## 3 A 10 24
## 4 A 11 1
## 5 A 14 47
## 6 A 33 11
## 7 B 25 2
## 8 C 5 73
## 9 C 13 7
## 10 D 1 1
## # ... with more rows
```
]
---
## Full query
.small[
To force a full query *and* a return of the complete result it is necessart to use the `collect` function.
```{r, eval=FALSE}
system.time({
nyc_sql %>%
select(issue_date, issuing_agency, violation_precinct, house_number, street_name) %>%
filter(violation_precinct >=1, violation_precinct <= 34) %>%
group_by(issuing_agency, violation_precinct) %>%
summarize(n=n()) %>%
collect()
})
## user system elapsed
## 5.915 0.507 6.445
## # A tibble: 199 x 3
## # Groups: issuing_agency [15]
## issuing_agency violation_precinct n
##
## 1 A 1 13
## 2 A 7 1
## 3 A 10 24
## 4 A 11 1
## 5 A 14 47
## 6 A 33 11
## 7 B 25 2
## 8 C 5 73
## 9 C 13 7
## 10 D 1 1
## # ... with 189 more rows
```
`compute` and `collapse` also force a full query but have slightly different behavior and return types.
]
---
## Acknowledgments
Above materials are derived in part from the following sources:
* [dbplyr - Introduction Vignette](https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html)
* [dbplyr - SQL Translation](https://cran.r-project.org/web/packages/dbplyr/vignettes/sql-translation.html)