dplyr and SQL


The why of databases

Numbers everyone should know

Task Timing
L1 cache reference 0.5 ns
L2 cache reference 7 ns
Main memory reference 100 ns
Read 1 MB sequentially from memory 250,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from network 10,000,000 ns
Read 1 MB sequentially from disk 30,000,000 ns
From: http://surana.wordpress.com/2009/01/01/numbers-everyone-should-know/

Implications for bigish data

Lets imagine we have a 10 GB flat data file and that we want to select certain rows based on a given criteria. This requires a sequential read across the entire data set.

If we can store the file in memory:

  • 10 GB × (250 μs/1 MB) = 0.25 seconds

If we have to access the file from disk:

  • 10 GB × (30 ms/1 MB) = 30 seconds


This is just for reading data, if we make any modifications (writing) things are much worse.

Implications for big data

What about a 100 GB flat data file?

If we can store the file in memory:

  • 100 GB × (250 μs/1 MB) = 2.5 seconds

If we have to access the file from disk:

  • 100 GB × (30 ms/1 MB) = 300 seconds


This is actually incredibly optimistic since it assumes that all the data on disk can be read in one continuous read.

Blocks

Cost: Disk << Memory

Speed: Disk <<< Memory


So usually possible to grow our disk storage to accommodate our data. However, memory is usually the limiting resource, and if we can’t fit everything into memory?


Create blocks - group rows based on similar attributes and read in multiple rows at a time. Optimal size will depend on the task and the properties of the disk.

Binary Search Example

Age Name
19 Carol
20 Greg
21 Alice
21 Dave
22 Eve
23 Bob
23 Frank



Lets search for records for people who are 22 or older.

and then?

This is just barely scratching the surface,

  • Efficiency gains are not just for disk, access is access

  • In general, trade off between storage and efficiency

  • Reality is a lot more complicated for everything mentioned so far, lots of very smart people have spent a lot of time thinking about and implementing tools

  • Different tasks with different requirements require different implementations and have different criteria for optimization

SQL

Structures Query Language is a special purpose language for interacting with (querying and modifying) these indexed tabular data structures.

  • ANSI Standard but with some dialect divergence

  • This functionality maps very closely (but not exactly) with the data manipulation verbs present in dplyr.

  • We will see this mapping in more detail in a bit.

Back to dplyr

dplyr & sqlite

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## Loading required package: methods
library(stringr)

park = read.csv("/home/vis/cr173/Sta523/data/parking/NYParkingViolations_small.csv",
                stringsAsFactors=FALSE) %>% 
       as.data.frame() %>%
       tbl_df()

dir.create("~/db",showWarnings = FALSE)
db = src_sqlite("~/db/park_db.sqlite3", create = TRUE)
## Loading required package: RSQLite
## Loading required package: DBI
## Loading required package: RSQLite.extfuns
dir("~/db/")
## [1] "park_db.sqlite3"              "park_full_db.sqlite3"         "park_full_indexed_db.sqlite3"

park_sqlite = copy_to(db, park, temporary = FALSE)
park_sqlite
## Source: sqlite 3.7.17 [~/db/park_db.sqlite3]
## From: park [91,003 x 43]
## 
##    Summons.Number Plate.ID Registration.State Plate.Type Issue.Date Violation.Code Vehicle.Body.Type
## 1       1.359e+09  FXX9781                 NY        PAS 02/20/2014             20              SUBN
## 2       7.486e+09  FLZ6021                 NY        PAS 08/12/2013             37              4DSD
## 3       1.354e+09  53902MB                 NY        COM 10/24/2013             14               VAN
## 4       1.342e+09  FYM2426                 NY        PAS 09/16/2013             21               SDN
## 5       1.372e+09  GPV3714                 NY        PAS 06/10/2014             71              SUBN
## 6       1.362e+09   XBAV11                 NJ        PAS 12/27/2013             78               VAN
## 7       7.004e+09  59305JY                 NY        COM 09/18/2013             38              DELV
## 8       1.362e+09  2146518                 IN        PAS 11/22/2013             41               VAN
## 9       7.541e+09  49138M2                 MD        PAS 12/18/2013             38              SUBN
## 10      7.311e+09  GHD5283                 NY        PAS 09/24/2013             46              4DSD
## ..            ...      ...                ...        ...        ...            ...               ...
## Variables not shown: Vehicle.Make (chr), Issuing.Agency (chr), Street.Code1 (int), Street.Code2 (int),
##   Street.Code3 (int), Vehicle.Expiration.Date (int), Violation.Location (int), Violation.Precinct (int),
##   Issuer.Precinct (int), Issuer.Code (int), Issuer.Command (chr), Issuer.Squad (chr), Violation.Time (chr),
##   Time.First.Observed (chr), Violation.County (chr), Violation.In.Front.Of.Or.Opposite (chr), House.Number
##   (chr), Street.Name (chr), Intersecting.Street (chr), Date.First.Observed (int), Law.Section (int),
##   Sub.Division (chr), Violation.Legal.Code (chr), Days.Parking.In.Effect.... (chr), From.Hours.In.Effect
##   (chr), To.Hours.In.Effect (chr), Vehicle.Color (chr), Unregistered.Vehicle. (int), Vehicle.Year (int),
##   Meter.Number (chr), Feet.From.Curb (int), Violation.Post.Code (chr), Violation.Description (chr),
##   No.Standing.or.Stopping.Violation (int), Hydrant.Violation (int), Double.Parking.Violation (int)

Limitations

(addr = select(park_sqlite, Violation.Precinct, House.Number, Street.Name) %>%
        filter(Violation.Precinct <= 34) %>%
        mutate(House.Number = str_trim(House.Number), Street.Name = str_trim(Street.Name)) %>%
        filter(House.Number != "" & Street.Name != "") %>%
        filter(str_detect(House.Number,"[0-9]+")) %>%
        transmute(Violation.Precinct = Violation.Precinct, addr = paste(House.Number, Street.Name)) %>%
        mutate(addr = tolower(addr)))
## Source: sqlite 3.7.17 [~/db/park_db.sqlite3]
## Error: RS-DBI driver: (error in statement: no such function: PASTE)

(addr = select(park_sqlite, Violation.Precinct, House.Number, Street.Name) %>%
        filter(Violation.Precinct <= 34) %>%
        mutate(House.Number = trim(House.Number), Street.Name = trim(Street.Name)) %>%
        filter(House.Number != "" & Street.Name != "") 
)
## Source: sqlite 3.7.17 [~/db/park_db.sqlite3]
## From: park [35,283 x 5]
## Filter: Violation.Precinct <= 34, House.Number != "" & Street.Name != "" 
## 
##    Violation.Precinct House.Number   Street.Name House.Number.1 Street.Name.1
## 1                   0          840         8 AVE            840         8 AVE
## 2                  17          643 Lexington Ave            643 Lexington Ave
## 3                  17          305     E 47th St            305     E 47th St
## 4                  14            N     W 43rd St              N     W 43rd St
## 5                  24           39    W 105th St             39    W 105th St
## 6                  18          220     W 49th St            220     W 49th St
## 7                  18            6     W 48th St              6     W 48th St
## 8                  14          205     W 39th St            205     W 39th St
## 9                  10          255     W 14th St            255     W 14th St
## 10                 24            S     W 97th St              S     W 97th St
## ..                ...          ...           ...            ...           ...

(addr = select(park_sqlite, Violation.Precinct, House.Number, Street.Name) %>%
        filter(Violation.Precinct <= 34) %>%
        filter(trim(House.Number) != "" & trim(House.Number) != "") 
)
## Source: sqlite 3.7.17 [~/db/park_db.sqlite3]
## From: park [35,288 x 3]
## Filter: Violation.Precinct <= 34, trim(House.Number) != "" & trim(House.Number) != "" 
## 
##    Violation.Precinct House.Number   Street.Name
## 1                   0          840         8 AVE
## 2                  17          643 Lexington Ave
## 3                  17          305     E 47th St
## 4                  14            N     W 43rd St
## 5                  24           39    W 105th St
## 6                  18          220     W 49th St
## 7                  18            6     W 48th St
## 8                  14          205     W 39th St
## 9                  10          255     W 14th St
## 10                 24            S     W 97th St
## ..                ...          ...           ...

SQL Query

addr$query
## <Query> SELECT "Violation.Precinct" AS "Violation.Precinct", "House.Number" AS "House.Number", "Street.Name" AS "Street.Name"
## FROM "park"
## WHERE "Violation.Precinct" <= 34.0 AND TRIM("House.Number") != '' AND TRIM("House.Number") != ''
## <SQLiteConnection: DBI CON (28575, 0)>

SQL Translation

dplyr has a function, translate_sql, that lets you experiment with how R functions are translated to SQL

translate_sql(x == 1 & (y < 2 | z > 3))
## <SQL> "x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)
translate_sql(x ^ 2 < 10)
## <SQL> POWER("x", 2.0) < 10.0
translate_sql(x %% 2 == 10)
## <SQL> "x" % 2.0 = 10.0
translate_sql(paste(x,y))
## <SQL> PASTE("x", "y")
translate_sql(mean(x))
## <SQL> AVG("x")
translate_sql(mean(x, na.rm=TRUE))
## Error: na.rm not needed in SQL: NULL are always dropped


In general, dplyr knows how to translate basic math, logical, and summary functions from R to SQL.

(Unfair) Timings

system.time(
  select(park, Violation.Precinct, House.Number, Street.Name) %>%
  filter(Violation.Precinct <= 34) %>%
  filter(str_trim(House.Number) != "" & str_trim(House.Number) != "") 
)
##    user  system elapsed 
##   0.061   0.001   0.062
system.time(
  select(park_sqlite, Violation.Precinct, House.Number, Street.Name) %>%
  filter(Violation.Precinct <= 34) %>%
  filter(trim(House.Number) != "" & trim(House.Number) != "") 
)
##    user  system elapsed 
##   0.018   0.000   0.019

park_sqlite was 3.4x times faster than park, but the latter is disk based while the former is wholly in memory, why this discrepancy?

Laziness

dplyr uses lazy evaluation as much as possible, particularly when working with SQL backends.

  • When building a query, we don’t want the entire table, often we just enough to check if our query is working.

  • Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.

  • Therefore, by default dplyr

    • won’t connect and query the database until absolutely necessary (e.g. show output),

    • unless explicitly told to, only query a handful of rows

Full query

To force a full query, a return a complete tbl_df object dplyr uses the collect function.

collect(addr)
## Source: local data frame [35,288 x 3]
## 
##    Violation.Precinct House.Number   Street.Name
## 1                   0          840         8 AVE
## 2                  17          643 Lexington Ave
## 3                  17          305     E 47th St
## 4                  14            N     W 43rd St
## 5                  24           39    W 105th St
## 6                  18          220     W 49th St
## 7                  18            6     W 48th St
## 8                  14          205     W 39th St
## 9                  10          255     W 14th St
## 10                 24            S     W 97th St
## ..                ...          ...           ...

compute and collapse also force a full query but have slightly different behavior and return types.

Creating Indexes

First some data loading and cleaning

library(data.table)
library(bit64)

park_full = fread("/home/vis/cr173/Sta523/data/parking/NYParkingViolations.csv") %>% 
            data.frame() %>%
            tbl_df() %>%
            select(Plate.ID:Violation.Code, Violation.Precinct, House.Number:Intersecting.Street) %>%
            mutate(Issue.Date = mdy(Issue.Date)) %>%
            mutate(Year = year(Issue.Date), Month = month(Issue.Date), Day = day(Issue.Date)) %>%
            select(-Issue.Date)

full = src_sqlite("~/db/park_full_db.sqlite3", create = TRUE) %>%
       copy_to(park_full, temporary = FALSE)

full_index = src_sqlite("~/db/park_full_indexed_db.sqlite3", create = TRUE) %>%
             copy_to(park_full, temporary = FALSE,
                     index = list(c("Year", "Month", "Day"), "Violation.Precinct", "Violation.Code"))

The indexed database takes up more disk space:

cr173@saxon [Sta523_Fa14]$ ls -lh ~/db
total 740M
-rw-r--r--+ 1 cr173 visitor    0 Oct 19 22:24 park_db.sqlite3
-rw-r--r--+ 1 cr173 visitor 483M Oct 20 11:59 park_full_db.sqlite3
-rw-r--r--+ 1 cr173 visitor 792M Oct 20 12:04 park_full_indexed_db.sqlite3

Timings for filtering bad dates

system.time(full %>% filter( !(Year %in% 2013:2014) ) %>% collect)
##    user  system elapsed 
##   3.272   0.466   3.747
system.time(full_index %>% filter( !(Year %in% 2013:2014) ) %>% collect)
##    user  system elapsed 
##   3.261   0.499   3.765

Timings for grouping

system.time(full %>% filter( Year %in% 2013:2014 ) %>% 
            group_by(Year, Month, Violation.Precinct) %>% summarize(count = n()) %>%  
            collect)
##    user  system elapsed 
##  41.238   2.097  43.347
system.time(full_index %>% filter( Year %in% 2013:2014 ) %>% 
            group_by(Year, Month, Violation.Precinct) %>% summarize(count = n()) %>%  
            collect)
##    user  system elapsed 
##  40.488   2.175  42.659

Acknowledgments

Acknowledgments

Above materials are derived in part from the following sources: