The why of databases

Numbers everyone should know

Task	Timing
L1 cache reference	0.5 ns
L2 cache reference	7 ns
Main memory reference	100 ns
Read 1 MB sequentially from memory	250,000 ns
Disk seek	10,000,000 ns
Read 1 MB sequentially from network	10,000,000 ns
Read 1 MB sequentially from disk	30,000,000 ns

From: http://surana.wordpress.com/2009/01/01/numbers-everyone-should-know/

Implications for bigish data

Lets imagine we have a 10 GB flat data file and that we want to select certain rows based on a given criteria. This requires a sequential read across the entire data set.

If we can store the file in memory:

10 GB × (250 μs/1 MB) = 0.25 seconds

If we have to access the file from disk:

10 GB × (30 ms/1 MB) = 30 seconds

This is just for reading data, if we make any modifications (writing) things are much worse.

Implications for big data

What about a 100 GB flat data file?

If we can store the file in memory:

100 GB × (250 μs/1 MB) = 2.5 seconds

If we have to access the file from disk:

100 GB × (30 ms/1 MB) = 300 seconds

This is actually incredibly optimistic since it assumes that all the data on disk can be read in one continuous read.

Blocks

Cost: Disk << Memory

Speed: Disk <<< Memory

So usually possible to grow our disk storage to accommodate our data. However, memory is usually the limiting resource, and if we can’t fit everything into memory?

Create blocks - group rows based on similar attributes and read in multiple rows at a time. Optimal size will depend on the task and the properties of the disk.

Linear vs Binary Search

Even with blocks, any kind of subsetting of rows requires a linear search, which requires O(N) accesses where N is the number of blocks.

We can do much better if we properly structure our data, specifically sorting some or all of the columns.

Sorting is expensive, O(NlogN), but it only needs to be done once.
After sorting, we can use a binary search for any subsetting tasks which is much faster, O(logN).
These sorted columns are known as indexes.
Indexes require additional storage, but usually small enough to be kept in memory while blocks stay on disk.

Binary Search Example

Age	Name
19	Carol
20	Greg
21	Alice
21	Dave
22	Eve
23	Bob
23	Frank

Lets search for records for people who are 22 or older.

and then?

This is just barely scratching the surface,

Efficiency gains are not just for disk, access is access
In general, trade off between storage and efficiency
Reality is a lot more complicated for everything mentioned so far, lots of very smart people have spent a lot of time thinking about and implementing tools
Different tasks with different requirements require different implementations and have different criteria for optimization

SQL

Structures Query Language is a special purpose language for interacting with (querying and modifying) these indexed tabular data structures.

ANSI Standard but with some dialect divergence
This functionality maps very closely (but not exactly) with the data manipulation verbs present in dplyr.
We will see this mapping in more detail in a bit.

Back to dplyr

dplyr & sqlite

library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)

## Loading required package: methods

library(stringr)

park = read.csv("/home/vis/cr173/Sta523/data/parking/NYParkingViolations_small.csv",
                stringsAsFactors=FALSE) %>% 
       as.data.frame() %>%
       tbl_df()

dir.create("~/db",showWarnings = FALSE)
db = src_sqlite("~/db/park_db.sqlite3", create = TRUE)

## Loading required package: RSQLite
## Loading required package: DBI
## Loading required package: RSQLite.extfuns

dir("~/db/")

## [1] "park_db.sqlite3"              "park_full_db.sqlite3"         "park_full_indexed_db.sqlite3"

park_sqlite = copy_to(db, park, temporary = FALSE)
park_sqlite

## Source: sqlite 3.7.17 [~/db/park_db.sqlite3]
## From: park [91,003 x 43]
## 
##    Summons.Number Plate.ID Registration.State Plate.Type Issue.Date Violation.Code Vehicle.Body.Type
## 1       1.359e+09  FXX9781                 NY        PAS 02/20/2014             20              SUBN
## 2       7.486e+09  FLZ6021                 NY        PAS 08/12/2013             37              4DSD
## 3       1.354e+09  53902MB                 NY        COM 10/24/2013             14               VAN
## 4       1.342e+09  FYM2426                 NY        PAS 09/16/2013             21               SDN
## 5       1.372e+09  GPV3714                 NY        PAS 06/10/2014             71              SUBN
## 6       1.362e+09   XBAV11                 NJ        PAS 12/27/2013             78               VAN
## 7       7.004e+09  59305JY                 NY        COM 09/18/2013             38              DELV
## 8       1.362e+09  2146518                 IN        PAS 11/22/2013             41               VAN
## 9       7.541e+09  49138M2                 MD        PAS 12/18/2013             38              SUBN
## 10      7.311e+09  GHD5283                 NY        PAS 09/24/2013             46              4DSD
## ..            ...      ...                ...        ...        ...            ...               ...
## Variables not shown: Vehicle.Make (chr), Issuing.Agency (chr), Street.Code1 (int), Street.Code2 (int),
##   Street.Code3 (int), Vehicle.Expiration.Date (int), Violation.Location (int), Violation.Precinct (int),
##   Issuer.Precinct (int), Issuer.Code (int), Issuer.Command (chr), Issuer.Squad (chr), Violation.Time (chr),
##   Time.First.Observed (chr), Violation.County (chr), Violation.In.Front.Of.Or.Opposite (chr), House.Number
##   (chr), Street.Name (chr), Intersecting.Street (chr), Date.First.Observed (int), Law.Section (int),
##   Sub.Division (chr), Violation.Legal.Code (chr), Days.Parking.In.Effect.... (chr), From.Hours.In.Effect
##   (chr), To.Hours.In.Effect (chr), Vehicle.Color (chr), Unregistered.Vehicle. (int), Vehicle.Year (int),
##   Meter.Number (chr), Feet.From.Curb (int), Violation.Post.Code (chr), Violation.Description (chr),
##   No.Standing.or.Stopping.Violation (int), Hydrant.Violation (int), Double.Parking.Violation (int)

Limitations

(addr = select(park_sqlite, Violation.Precinct, House.Number, Street.Name) %>%
        filter(Violation.Precinct <= 34) %>%
        mutate(House.Number = str_trim(House.Number), Street.Name = str_trim(Street.Name)) %>%
        filter(House.Number != "" & Street.Name != "") %>%
        filter(str_detect(House.Number,"[0-9]+")) %>%
        transmute(Violation.Precinct = Violation.Precinct, addr = paste(House.Number, Street.Name)) %>%
        mutate(addr = tolower(addr)))

## Source: sqlite 3.7.17 [~/db/park_db.sqlite3]

## Error: RS-DBI driver: (error in statement: no such function: PASTE)

(addr = select(park_sqlite, Violation.Precinct, House.Number, Street.Name) %>%
        filter(Violation.Precinct <= 34) %>%
        mutate(House.Number = trim(House.Number), Street.Name = trim(Street.Name)) %>%
        filter(House.Number != "" & Street.Name != "") 
)

## Source: sqlite 3.7.17 [~/db/park_db.sqlite3]
## From: park [35,283 x 5]
## Filter: Violation.Precinct <= 34, House.Number != "" & Street.Name != "" 
## 
##    Violation.Precinct House.Number   Street.Name House.Number.1 Street.Name.1
## 1                   0          840         8 AVE            840         8 AVE
## 2                  17          643 Lexington Ave            643 Lexington Ave
## 3                  17          305     E 47th St            305     E 47th St
## 4                  14            N     W 43rd St              N     W 43rd St
## 5                  24           39    W 105th St             39    W 105th St
## 6                  18          220     W 49th St            220     W 49th St
## 7                  18            6     W 48th St              6     W 48th St
## 8                  14          205     W 39th St            205     W 39th St
## 9                  10          255     W 14th St            255     W 14th St
## 10                 24            S     W 97th St              S     W 97th St
## ..                ...          ...           ...            ...           ...

(addr = select(park_sqlite, Violation.Precinct, House.Number, Street.Name) %>%
        filter(Violation.Precinct <= 34) %>%
        filter(trim(House.Number) != "" & trim(House.Number) != "") 
)

## Source: sqlite 3.7.17 [~/db/park_db.sqlite3]
## From: park [35,288 x 3]
## Filter: Violation.Precinct <= 34, trim(House.Number) != "" & trim(House.Number) != "" 
## 
##    Violation.Precinct House.Number   Street.Name
## 1                   0          840         8 AVE
## 2                  17          643 Lexington Ave
## 3                  17          305     E 47th St
## 4                  14            N     W 43rd St
## 5                  24           39    W 105th St
## 6                  18          220     W 49th St
## 7                  18            6     W 48th St
## 8                  14          205     W 39th St
## 9                  10          255     W 14th St
## 10                 24            S     W 97th St
## ..                ...          ...           ...

SQL Query

addr$query

## <Query> SELECT "Violation.Precinct" AS "Violation.Precinct", "House.Number" AS "House.Number", "Street.Name" AS "Street.Name"
## FROM "park"
## WHERE "Violation.Precinct" <= 34.0 AND TRIM("House.Number") != '' AND TRIM("House.Number") != ''
## <SQLiteConnection: DBI CON (28575, 0)>

SQL Translation

dplyr has a function, translate_sql, that lets you experiment with how R functions are translated to SQL

translate_sql(x == 1 & (y < 2 | z > 3))

## <SQL> "x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)

translate_sql(x ^ 2 < 10)

## <SQL> POWER("x", 2.0) < 10.0

translate_sql(x %% 2 == 10)

## <SQL> "x" % 2.0 = 10.0

translate_sql(paste(x,y))

## <SQL> PASTE("x", "y")

translate_sql(mean(x))

## <SQL> AVG("x")

translate_sql(mean(x, na.rm=TRUE))

## Error: na.rm not needed in SQL: NULL are always dropped

In general, dplyr knows how to translate basic math, logical, and summary functions from R to SQL.

(Unfair) Timings

system.time(
  select(park, Violation.Precinct, House.Number, Street.Name) %>%
  filter(Violation.Precinct <= 34) %>%
  filter(str_trim(House.Number) != "" & str_trim(House.Number) != "") 
)

##    user  system elapsed 
##   0.061   0.001   0.062

system.time(
  select(park_sqlite, Violation.Precinct, House.Number, Street.Name) %>%
  filter(Violation.Precinct <= 34) %>%
  filter(trim(House.Number) != "" & trim(House.Number) != "") 
)

##    user  system elapsed 
##   0.018   0.000   0.019

park_sqlite was 3.4x times faster than park, but the latter is disk based while the former is wholly in memory, why this discrepancy?

Laziness

dplyr uses lazy evaluation as much as possible, particularly when working with SQL backends.

When building a query, we don’t want the entire table, often we just enough to check if our query is working.
Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.
Therefore, by default dplyr
- won’t connect and query the database until absolutely necessary (e.g. show output),
- unless explicitly told to, only query a handful of rows

Full query

To force a full query, a return a complete tbl_df object dplyr uses the collect function.

collect(addr)

## Source: local data frame [35,288 x 3]
## 
##    Violation.Precinct House.Number   Street.Name
## 1                   0          840         8 AVE
## 2                  17          643 Lexington Ave
## 3                  17          305     E 47th St
## 4                  14            N     W 43rd St
## 5                  24           39    W 105th St
## 6                  18          220     W 49th St
## 7                  18            6     W 48th St
## 8                  14          205     W 39th St
## 9                  10          255     W 14th St
## 10                 24            S     W 97th St
## ..                ...          ...           ...

compute and collapse also force a full query but have slightly different behavior and return types.

Creating Indexes

First some data loading and cleaning

library(data.table)
library(bit64)

park_full = fread("/home/vis/cr173/Sta523/data/parking/NYParkingViolations.csv") %>% 
            data.frame() %>%
            tbl_df() %>%
            select(Plate.ID:Violation.Code, Violation.Precinct, House.Number:Intersecting.Street) %>%
            mutate(Issue.Date = mdy(Issue.Date)) %>%
            mutate(Year = year(Issue.Date), Month = month(Issue.Date), Day = day(Issue.Date)) %>%
            select(-Issue.Date)

full = src_sqlite("~/db/park_full_db.sqlite3", create = TRUE) %>%
       copy_to(park_full, temporary = FALSE)

full_index = src_sqlite("~/db/park_full_indexed_db.sqlite3", create = TRUE) %>%
             copy_to(park_full, temporary = FALSE,
                     index = list(c("Year", "Month", "Day"), "Violation.Precinct", "Violation.Code"))

The indexed database takes up more disk space:

cr173@saxon [Sta523_Fa14]$ ls -lh ~/db
total 740M
-rw-r--r--+ 1 cr173 visitor    0 Oct 19 22:24 park_db.sqlite3
-rw-r--r--+ 1 cr173 visitor 483M Oct 20 11:59 park_full_db.sqlite3
-rw-r--r--+ 1 cr173 visitor 792M Oct 20 12:04 park_full_indexed_db.sqlite3

Timings for filtering bad dates

system.time(full %>% filter( !(Year %in% 2013:2014) ) %>% collect)

##    user  system elapsed 
##   3.272   0.466   3.747

system.time(full_index %>% filter( !(Year %in% 2013:2014) ) %>% collect)

##    user  system elapsed 
##   3.261   0.499   3.765

Timings for grouping

system.time(full %>% filter( Year %in% 2013:2014 ) %>% 
            group_by(Year, Month, Violation.Precinct) %>% summarize(count = n()) %>%  
            collect)

##    user  system elapsed 
##  41.238   2.097  43.347

system.time(full_index %>% filter( Year %in% 2013:2014 ) %>% 
            group_by(Year, Month, Violation.Precinct) %>% summarize(count = n()) %>%  
            collect)

##    user  system elapsed 
##  40.488   2.175  42.659

Acknowledgments

Above materials are derived in part from the following sources:

dplyr and SQL

The why of databases

Numbers everyone should know

Implications for bigish data

Implications for big data

Blocks

Linear vs Binary Search

Binary Search Example

and then?

SQL

Back to dplyr

dplyr & sqlite

Limitations

SQL Query

SQL Translation

(Unfair) Timings

Laziness

Full query

Creating Indexes

Timings for filtering bad dates

Timings for grouping

Acknowledgments

Acknowledgments