The why of databases

Numbers everyone should know

Task	Timing
L1 cache reference	0.5 ns
L2 cache reference	7 ns
Main memory reference	100 ns
Read 1 MB sequentially from memory	250,000 ns
Disk seek	10,000,000 ns
Read 1 MB sequentially from network	10,000,000 ns
Read 1 MB sequentially from disk	30,000,000 ns

From: http://surana.wordpress.com/2009/01/01/numbers-everyone-should-know/

Implications for bigish data

Lets imagine we have a 10 GB flat data file and that we want to select certain rows based on a given criteria. This requires a sequential read across the entire data set.

If we can store the file in memory:

\(10~GB \times (250~\mu s / 1~MB) = 0.25\) seconds

If we have to access the file from disk:

\(10~GB \times (30~ms / 1~MB) = 30\) seconds

This is just for reading data, if we make any modifications (writing) things are much worse.

Blocks

Cost: Disk << Memory

Speed: Disk <<< Memory

So usually possible to grow our disk storage to accommodate our data. However, memory is usually the limiting resource, and if we can’t fit everything into memory?

Create blocks - group rows based on similar attributes and read in multiple rows at a time. Optimal size will depend on the task and the properties of the disk.

Linear vs Binary Search

Even with blocks, any kind of subsetting of rows requires a linear search, which requires \(\mathcal{O}(N)\) accesses where \(N\) is the number of blocks.

We can do much better if we are careful about how we structure our data, specifically sorting some or all of the columns.

Sorting is expensive, \(\mathcal{O}(N \log N)\), but it only needs to be done once.
After sorting, we can use a binary search for any subsetting tasks (\(\mathcal{O}(\log N)\)).
These sorted columns are known as indexes.
Indexes require additional storage, but usually small enough to be kept in memory while blocks stay on disk.

Binary Search Example

Age	Name
19	Carol
20	Greg
21	Alice
21	Dave
22	Eve
23	Bob
23	Frank

Lets search for records for people who are 22 or older.

and then?

This is just barely scratching the surface,

Efficiency gains are not just for disk, access is access
In general, trade off between storage and efficiency
Reality is a lot more complicated for everything mentioned so far, lots of very smart people have spent a lot of time thinking about and implementing tools
Different tasks with different requirements require different implementations and have different criteria for optimization

SQL

Structures Query Language is a special purpose language for interacting with (querying and modifying) these indexed tabular data structures.

ANSI Standard but with some dialect divergence
This functionality maps very closely (but not exactly) with the data manipulation verbs present in dplyr.
We will see this mapping in more detail in a bit.

Back to dplyr

data

library(dplyr)
library(readr)
library(lubridate)
library(stringr)

nyc = read_csv("/data/nyc_parking/NYParkingViolations.csv")

##  |================================================================================| 100% 1713 MB
##   
##  Warning: 654437 parsing failures.
##   row                  col   expected actual
##  2647 Violation Legal Code an integer      T
##  3792 Violation Legal Code an integer      T
##  4001 Violation Legal Code an integer      T
##  4002 Violation Legal Code an integer      T
##  4003 Violation Legal Code an integer      T
##  .... .................... .......... ......
##  .See problems(...) for more details.

nyc

##  Source: local data frame [9,100,278 x 43]
##  
##     Summons Number Plate ID Registration State Plate Type Issue Date Violation Code
##              (dbl)    (chr)              (chr)      (chr)      (chr)          (int)
##  1      1361929741  FCJ5493                 NY        PAS 12/18/1970             20
##  2      1366962000  63540MC                 NY        COM 02/02/1971             46
##  3      1356906515  GFM1421                 NY        PAS 09/18/1971             40
##  4      1342296217  FYM5117                 NY        SRF 09/18/1971             21
##  5      1342296199  95V6675                 TX        PAS 09/18/1971             21
##  6      1342296187  GCY4187                 NY        SRF 09/18/1971             21
##  7      1337077380  18972BB                 NY        999 10/10/1971             14
##  8      1364523796  WNJ4730                 VA        PAS 04/05/1973             14
##  9      1359914924  68091JZ                 NY        COM 07/22/1973             46
##  10     1355498326  EWV4127                 NY        PAS 08/12/1973             21
##  ..            ...      ...                ...        ...        ...            ...
##  Variables not shown: Vehicle Body Type (chr), Vehicle Make (chr), Issuing Agency (chr), Street
##    Code1 (int), Street Code2 (int), Street Code3 (int), Vehicle Expiration Date (int), Violation
##    Location (chr), Violation Precinct (int), Issuer Precinct (int), Issuer Code (int), Issuer
##    Command (chr), Issuer Squad (chr), Violation Time (chr), Time First Observed (chr), Violation
##    County (chr), Violation In Front Of Or Opposite (chr), House Number (chr), Street Name (chr),
##    Intersecting Street (chr), Date First Observed (int), Law Section (int), Sub Division (chr),
##    Violation Legal Code (int), Days Parking In Effect (chr), From Hours In Effect (chr), To Hours
##    In Effect (chr), Vehicle Color (chr), Unregistered Vehicle? (int), Vehicle Year (int), Meter
##    Number (chr), Feet From Curb (int), Violation Post Code (chr), Violation Description (chr), No
##    Standing or Stopping Violation (chr), Hydrant Violation (chr), Double Parking Violation (chr)

Creating an sqlite database

(db = src_sqlite("/data/nyc_parking/NYParkingViolations.sqlite", create = TRUE))

##  src:  sqlite 3.8.6 [/data/nyc_parking/NYParkingViolations.sqlite]
##  tbls:

nyc_sql = copy_to(db, nyc, temporary = FALSE)
db

##  src:  sqlite 3.8.6 [/data/nyc_parking/NYParkingViolations.sqlite]
##  tbls: nyc, sqlite_stat1

nyc_sql = tbl(db,"nyc")
str(nyc_sql)

##  List of 9
##   $ src      :List of 3
##    ..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 5 slots
##    .. .. ..@ Id                 :<externalptr> 
##    .. .. ..@ dbname             : chr "/data/nyc_parking/NYParkingViolations.sqlite"
##    .. .. ..@ loadable.extensions: logi TRUE
##    .. .. ..@ flags              : int 6
##    .. .. ..@ vfs                : chr ""
##    ..$ path: chr "/data/nyc_parking/NYParkingViolations.sqlite"
##    ..$ info:List of 2
##    .. ..$ serverVersion: chr "3.8.6"
##    .. ..$ results      : logi FALSE
##    ..- attr(*, "class")= chr [1:3] "src_sqlite" "src_sql" "src"
##   $ from     :Classes 'ident', 'sql', 'character'  chr "nyc"
## ...

nyc_sql

##  Source: sqlite 3.8.6 [/data/nyc_parking/NYParkingViolations.sqlite]
##  From: nyc [9,100,278 x 43]
##  
##     Summons Number Plate ID Registration State Plate Type Issue Date Violation Code
##              (dbl)    (chr)              (chr)      (chr)      (chr)          (int)
##  1      1361929741  FCJ5493                 NY        PAS 12/18/1970             20
##  2      1366962000  63540MC                 NY        COM 02/02/1971             46
##  3      1356906515  GFM1421                 NY        PAS 09/18/1971             40
##  4      1342296217  FYM5117                 NY        SRF 09/18/1971             21
##  5      1342296199  95V6675                 TX        PAS 09/18/1971             21
##  6      1342296187  GCY4187                 NY        SRF 09/18/1971             21
##  7      1337077380  18972BB                 NY        999 10/10/1971             14
##  8      1364523796  WNJ4730                 VA        PAS 04/05/1973             14
##  9      1359914924  68091JZ                 NY        COM 07/22/1973             46
##  10     1355498326  EWV4127                 NY        PAS 08/12/1973             21
##  ..            ...      ...                ...        ...        ...            ...
##  Variables not shown: Vehicle Body Type (chr), Vehicle Make (chr), Issuing Agency (chr), Street
##    Code1 (int), Street Code2 (int), Street Code3 (int), Vehicle Expiration Date (int), Violation
##    Location (chr), Violation Precinct (int), Issuer Precinct (int), Issuer Code (int), Issuer
##    Command (chr), Issuer Squad (chr), Violation Time (chr), Time First Observed (chr), Violation
##    County (chr), Violation In Front Of Or Opposite (chr), House Number (chr), Street Name (chr),
##    Intersecting Street (chr), Date First Observed (int), Law Section (int), Sub Division (chr),
##    Violation Legal Code (int), Days Parking In Effect (chr), From Hours In Effect (chr), To Hours
##    In Effect (chr), Vehicle Color (chr), Unregistered Vehicle? (int), Vehicle Year (int), Meter
##    Number (chr), Feet From Curb (int), Violation Post Code (chr), Violation Description (chr), No
##    Standing or Stopping Violation (chr), Hydrant Violation (chr), Double Parking Violation (chr)

Using dplyr with sqlite

(addr = nyc_sql %>%
        select(`Issue Date`, `Issuing Agency`, `Violation Precinct`, `House Number`, `Street Name`) %>%
        filter(`Violation Precinct` >=1, `Violation Precinct` <= 34)
)

##  Source: sqlite 3.8.6 [/data/nyc_parking/NYParkingViolations.sqlite]
##  From: nyc [3,564,798 x 5]
##  Filter: `Violation Precinct` >= 1, `Violation Precinct` <= 34 
##  
##     Issue Date Issuing Agency Violation Precinct House Number  Street Name
##          (chr)          (chr)              (int)        (chr)        (chr)
##  1  09/18/1971              X                 33         4165     BROADWAY
##  2  07/22/1973              P                 10           48        7 AVE
##  3  09/22/1973              P                 14          205      W 39 ST
##  4  10/30/1973              K                  1           NA SOUTH STREET
##  5  04/15/1977              X                 17          545        1 AVE
##  6  01/01/2000              P                 17          875      3RD AVE
##  7  01/06/2000              P                 20          210      W 64 ST
##  8  01/07/2000              P                 15         1375        6 AVE
##  9  01/07/2000              P                 15         1375        6 AVE
##  10 01/08/2000              P                 19          244      EAST 84
##  ..        ...            ...                ...          ...          ...

SQL Query

class(addr)

## [1] "tbl_sqlite" "tbl_sql"    "tbl"  

addr$query

##  <Query> SELECT "Issue Date" AS "Issue Date", "Issuing Agency" AS "Issuing Agency", "Violation Precinct" AS "Violation ##  Precinct", "House Number" AS "House Number", "Street Name" AS "Street Name"
##  FROM "nyc"
##  WHERE "Violation Precinct" >= 1.0 AND "Violation Precinct" <= 34.0
##  <SQLiteConnection>

Limitations

addr %>% mutate(address = paste(`House Number`, `Street Name`))

##  Source: sqlite 3.8.6 [/data/nyc_parking/NYParkingViolations.sqlite]
##  Error in sqliteSendQuery(con, statement, bind.data) : 
##    error in statement: no such function: PASTE

addr %>% summarize(mean = mean(`Violation Precinct`, na.rm=TRUE))

##  Error: na.rm not needed in SQL: NULL are always dropped

addr %>% summarize(mean = mean(`Violation Precinct`))

##  Source: sqlite 3.8.6 [/data/nyc_parking/NYParkingViolations.sqlite]
##  From: <derived table> [?? x 1]
##  
##        mean
##       (dbl)
##  1  16.0982
##  ..     ...

SQL Grouping

addr %>% group_by(`Issuing Agency`, `Violation Precinct`) %>% summarize(n=n())

##  Source: sqlite 3.8.6 [/data/nyc_parking/NYParkingViolations.sqlite]
##  From: <derived table> [?? x 3]
##  Grouped by: `Issuing Agency` 
##  
##     Issuing Agency Violation Precinct     n
##              (chr)              (int) (int)
##  1               A                  1    13
##  2               A                  7     1
##  3               A                 10    24
##  4               A                 11     1
##  5               A                 14    47
##  6               A                 33    11
##  7               B                 10     1
##  8               B                 25     2
##  9               C                  5    73
##  10              C                 13     7
##  ..            ...                ...   ...

SQL Query

addr %>% group_by(`Issuing Agency`, `Violation Precinct`) %>% summarize(n=n()) %>% .$query

##  <Query> SELECT "Issuing Agency", "Violation Precinct", "n"
##  FROM (SELECT "Issuing Agency", "Violation Precinct", COUNT() AS "n"
##  FROM "nyc"
##  WHERE "Violation Precinct" >= 1.0 AND "Violation Precinct" <= 34.0
##  GROUP BY "Issuing Agency", "Violation Precinct") AS "zzz4"
##  <SQLiteConnection>

SQL Translation

dplyr has a function, translate_sql, that lets you experiment with how R functions are translated to SQL

translate_sql(x == 1 & (y < 2 | z > 3))

## <SQL> "x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)

translate_sql(x ^ 2 < 10)

## <SQL> POWER("x", 2.0) < 10.0

translate_sql(x %% 2 == 10)

## <SQL> "x" % 2.0 = 10.0

translate_sql(paste(x,y))

## <SQL> PASTE("x", "y")

translate_sql(mean(x))

## <SQL> avg("x") OVER ()

translate_sql(mean(x, na.rm=TRUE))

## Error in mean(x, na.rm = TRUE): unused argument (na.rm = TRUE)

In general, dplyr knows how to translate basic math, logical, and summary functions from R to SQL.

(Unfair) Timings

system.time(
    nyc %>%
    select(`Issue Date`, `Issuing Agency`, `Violation Precinct`, `House Number`, `Street Name`) %>%
    filter(`Violation Precinct` >=1, `Violation Precinct` <= 34) %>%
    group_by(`Issuing Agency`, `Violation Precinct`) %>%
    summarize(n=n())
)

##    user  system elapsed 
##   0.860   0.121   0.981 

system.time(
    nyc_sql %>%
    select(`Issue Date`, `Issuing Agency`, `Violation Precinct`, `House Number`, `Street Name`) %>%
    filter(`Violation Precinct` >=1, `Violation Precinct` <= 34) %>%
    group_by(`Issuing Agency`, `Violation Precinct`) %>%
    summarize(n=n())
)

##    user  system elapsed 
##   0.039   0.001   0.038

nyc_sql was 30x times faster than nyc, but the former is disk based while the latter is in memory, why this discrepancy?

Laziness

dplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.

When building a query, we don’t want the entire table, often we want just enough to check if our query is working.
Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.
Therefore, by default dplyr
- won’t connect and query the database until absolutely necessary (e.g. show output),
- and unless explicitly told to, will only query a handful of rows to give a sense of what the result will look like

nyc_sql %>%
  select(`Issue Date`, `Issuing Agency`, `Violation Precinct`, `House Number`, `Street Name`) %>%
  filter(`Violation Precinct` >=1, `Violation Precinct` <= 34) %>%
  group_by(`Issuing Agency`, `Violation Precinct`) %>%
  summarize(n=n())

##  Source: sqlite 3.8.6 [/data/nyc_parking/NYParkingViolations.sqlite]
##  From: <derived table> [?? x 3]
##  Grouped by: `Issuing Agency` 
##  
##     Issuing Agency Violation Precinct     n
##              (chr)              (int) (int)
##  1               A                  1    13
##  2               A                  7     1
##  3               A                 10    24
##  4               A                 11     1
##  5               A                 14    47
##  6               A                 33    11
##  7               B                 10     1
##  8               B                 25     2
##  9               C                  5    73
##  10              C                 13     7
##  ..            ...                ...   ...

Full query

To force a full query and return a complete tbl_df object dplyr uses the collect function.

nyc_sql %>%
  select(`Issue Date`, `Issuing Agency`, `Violation Precinct`, `House Number`, `Street Name`) %>%
  filter(`Violation Precinct` >=1, `Violation Precinct` <= 34) %>%
  group_by(`Issuing Agency`, `Violation Precinct`) %>%
  summarize(n=n()) %>%
  collect()

##  Source: local data frame [200 x 3]
##  Groups: Issuing Agency [15]
##  
##     Issuing Agency Violation Precinct     n
##              (chr)              (int) (int)
##  1               A                  1    13
##  2               A                  7     1
##  3               A                 10    24
##  4               A                 11     1
##  5               A                 14    47
##  6               A                 33    11
##  7               B                 10     1
##  8               B                 25     2
##  9               C                  5    73
##  10              C                 13     7
##  ..            ...                ...   ...

compute and collapse also force a full query but have slightly different behavior and return types.

Creating Indexes

(db_index = src_sqlite("/data/nyc_parking/NYParkingViolations_index.sqlite", create = TRUE))

## src:  sqlite 3.8.6 [/data/nyc_parking/NYParkingViolations_index.sqlite]
## tbls:

nyc_index = copy_to(db_index, nyc, temporary = FALSE,
                    index = list(`Violation Precinct`))

The indexed database takes up more disk space:

cr173@gort [~]$ ls -lh /data/nyc_parking/*.sqlite

-rw-r--r-- 1 cr173 visitor 1.8G Mar 30 11:56 NYParkingViolations_index.sqlite
-rw-r--r-- 1 cr173 visitor 1.7G Mar 30 11:28 NYParkingViolations.sqlite

Timings for filtering precincts

system.time(nyc_sql %>% filter(`Violation Precinct` <= 34, `Violation Precinct` >= 1) %>% collect())

##    user  system elapsed 
##  30.630   3.279  34.009 

system.time(nyc_index %>% filter(`Violation Precinct` <= 34, `Violation Precinct` >= 1) %>% collect())

##    user  system elapsed 
##  32.207   3.197  35.395

Timings for grouping

system.time(nyc_sql %>% group_by(`Violation Precinct`) %>% summarize(n=n()) %>% collect())

##     user  system elapsed 
##   21.184   2.829  24.007 

system.time(nyc_index %>% group_by(`Violation Precinct`) %>% summarize(n=n()) %>% collect())

##   user  system elapsed 
##  1.288   0.102   1.389

RSQLite

DBI

Low level package for interfacing R with Database management systems (DBMS) that provides a common interface to achieve the following functionality:

connect/disconnect from DBMS
create and execute statements in the DBMS
extract results/output from statements
error/exception handling
information (meta-data) from database objects
transaction management (optional)

RSQLite

Provides the implementation necessary to use DBI to interface with SQLite databases.

library(RSQLite)

## Loading required package: DBI

con = dbConnect(RSQLite::SQLite(), ":memory:")
str(con)

## Formal class 'SQLiteConnection' [package "RSQLite"] with 5 slots
##   ..@ Id                 :<externalptr> 
##   ..@ dbname             : chr ":memory:"
##   ..@ loadable.extensions: logi TRUE
##   ..@ flags              : int 6
##   ..@ vfs                : chr ""

Example Table

employees = data.frame(name   = c("Alice","Bob","Carol","Dave","Eve","Frank"),
                       email  = c("alice@company.com", "bob@company.com",
                                  "carol@company.com", "dave@company.com",
                                  "eve@company.com",   "frank@comany.com"),
                       salary = c(52000, 40000, 30000, 33000, 44000, 37000),
                       dept   = c("Accounting", "Accounting","Sales",
                                  "Accounting","Sales","Sales"),
                       stringsAsFactors = FALSE)

dbWriteTable(con, "employees", employees)

## [1] TRUE

dbListTables(con)

## [1] "employees"

Removing Tables

dbWriteTable(con, "employs", employees)

## [1] TRUE

dbListTables(con)

## [1] "employees" "employs"

dbRemoveTable(con,"employs")

## [1] TRUE

dbListTables(con)

## [1] "employees"

Querying Tables

res = dbSendQuery(con, "SELECT * FROM employees")
dbFetch(res)

##    name             email salary       dept
## 1 Alice alice@company.com  52000 Accounting
## 2   Bob   bob@company.com  40000 Accounting
## 3 Carol carol@company.com  30000      Sales
## 4  Dave  dave@company.com  33000 Accounting
## 5   Eve   eve@company.com  44000      Sales
## 6 Frank  frank@comany.com  37000      Sales

dbClearResult(res)

## [1] TRUE

Closing the connection

dbDisconnect(con)

## [1] TRUE

SQL Queries

Table information

The following is specific to SQLite

sqlite> .tables

employees

sqlite> .schema employees

CREATE TABLE employees 
( "name" TEXT,
    "email" TEXT,
    "salary" REAL,
    "dept" TEXT 
);

sqlite> .indices employees

SELECT Statements

sqlite> SELECT * FROM employees;

Alice|alice@company.com|52000.0|Accounting
Bob|bob@company.com|40000.0|Accounting
Carol|carol@company.com|30000.0|Sales
Dave|dave@company.com|33000.0|Accounting
Eve|eve@company.com|44000.0|Sales
Frank|frank@comany.com|37000.0|Sales

Pretty Output

We can make this table output a little nicer with some additonal SQLite options:

sqlite> .mode column
sqlite> .headers on

sqlite> SELECT * FROM employees;

name        email              salary      dept      
----------  -----------------  ----------  ----------
Alice       alice@company.com  52000.0     Accounting
Bob         bob@company.com    40000.0     Accounting
Carol       carol@company.com  30000.0     Sales     
Dave        dave@company.com   33000.0     Accounting
Eve         eve@company.com    44000.0     Sales     
Frank       frank@comany.com   37000.0     Sales

select using SELECT

We can subset for certain columns (and rename them) using SELECT

sqlite> SELECT name AS first_name, salary FROM employees;

first_name  salary    
----------  ----------
Alice       52000.0   
Bob         40000.0   
Carol       30000.0   
Dave        33000.0   
Eve         44000.0   
Frank       37000.0

arrange using ORDER BY

We can sort our results by adding ORDER BY to our SELECT statement

sqlite> SELECT name AS first_name, salary FROM employees ORDER BY salary;

first_name  salary    
----------  ----------
Carol       30000.0   
Dave        33000.0   
Frank       37000.0   
Bob         40000.0   
Eve         44000.0   
Alice       52000.0

We can sort in the opposite order by adding DESC

SELECT name AS first_name, salary FROM employees ORDER BY salary DESC;

first_name  salary    
----------  ----------
Alice       52000.0   
Eve         44000.0   
Bob         40000.0   
Frank       37000.0   
Dave        33000.0   
Carol       30000.0

filter via WHERE

We can filter rows by adding WHERE to our statements

sqlite> SELECT * FROM employees WHERE salary < 40000;

name        email              salary      dept      
----------  -----------------  ----------  ----------
Carol       carol@company.com  30000.0     Sales     
Dave        dave@company.com   33000.0     Accounting
Frank       frank@comany.com   37000.0     Sales  

sqlite> SELECT * FROM employees WHERE salary < 40000 AND dept = "Sales";

name        email              salary      dept      
----------  -----------------  ----------  ----------
Carol       carol@company.com  30000.0     Sales     
Frank       frank@comany.com   37000.0     Sales

group_by via GROUP BY

We can create groups for the purpose of summarizing using GROUP BY. As with dplyr it is not terribly useful by itself.

sqlite> SELECT * FROM employees GROUP BY dept;

name        email             salary      dept      
----------  ----------------  ----------  ----------
Dave        dave@company.com  33000.0     Accounting
Frank       frank@comany.com  37000.0     Sales     

sqlite> SELECT dept, AVG(salary) AS mean_salary, COUNT(*) AS n FROM employees GROUP BY dept;

dept        mean_salary       n         
----------  ----------------  ----------
Accounting  41666.6666666667  3         
Sales       37000.0           3

head via LIMIT

We can limit the number of rows we get by using LIMIT.

sqlite> SELECT * FROM employees LIMIT 3;

name        email              salary      dept      
----------  -----------------  ----------  ----------
Alice       alice@company.com  52000.0     Accounting
Bob         bob@company.com    40000.0     Accounting
Carol       carol@company.com  30000.0     Sales 

sqlite> SELECT * FROM employees ORDER BY name DESC LIMIT 3;

name        email             salary      dept      
----------  ----------------  ----------  ----------
Frank       frank@comany.com  37000.0     Sales     
Eve         eve@company.com   44000.0     Sales     
Dave        dave@company.com  33000.0     Accounting

Import CSV files

sqlite> .mode csv
sqlite> .import phone.csv phone
sqlite> .tables

employees  phone   

sqlite> .mode column
sqlite> SELECT * FROM phone;

name        phone       
----------  ------------
Bob         919 555-1111
Carol       919 555-2222
Eve         919 555-3333
Frank       919 555-4444

Joins - Default

By default SQLite uses a CROSS JOIN which is not terribly useful

sqlite> SELECT * FROM employees JOIN phone;

name        email              salary      dept        name        phone       
----------  -----------------  ----------  ----------  ----------  ------------
Alice       alice@company.com  52000.0     Accounting  Bob         919 555-1111
Alice       alice@company.com  52000.0     Accounting  Carol       919 555-2222
Alice       alice@company.com  52000.0     Accounting  Eve         919 555-3333
Alice       alice@company.com  52000.0     Accounting  Frank       919 555-4444
Bob         bob@company.com    40000.0     Accounting  Bob         919 555-1111
Bob         bob@company.com    40000.0     Accounting  Carol       919 555-2222
Bob         bob@company.com    40000.0     Accounting  Eve         919 555-3333
Bob         bob@company.com    40000.0     Accounting  Frank       919 555-4444
Carol       carol@company.com  30000.0     Sales       Bob         919 555-1111
Carol       carol@company.com  30000.0     Sales       Carol       919 555-2222
Carol       carol@company.com  30000.0     Sales       Eve         919 555-3333
Carol       carol@company.com  30000.0     Sales       Frank       919 555-4444
Dave        dave@company.com   33000.0     Accounting  Bob         919 555-1111
Dave        dave@company.com   33000.0     Accounting  Carol       919 555-2222
Dave        dave@company.com   33000.0     Accounting  Eve         919 555-3333
Dave        dave@company.com   33000.0     Accounting  Frank       919 555-4444
Eve         eve@company.com    44000.0     Sales       Bob         919 555-1111
Eve         eve@company.com    44000.0     Sales       Carol       919 555-2222
Eve         eve@company.com    44000.0     Sales       Eve         919 555-3333
Eve         eve@company.com    44000.0     Sales       Frank       919 555-4444
Frank       frank@comany.com   37000.0     Sales       Bob         919 555-1111
Frank       frank@comany.com   37000.0     Sales       Carol       919 555-2222
Frank       frank@comany.com   37000.0     Sales       Eve         919 555-3333
Frank       frank@comany.com   37000.0     Sales       Frank       919 555-4444

Inner Join

If you want SQLite to find the columns to merge on automatically then we prefix the join with NATURAL.

sqlite> SELECT * FROM employees NATURAL JOIN phone;

name        email            salary      dept        phone       
----------  ---------------  ----------  ----------  ------------
Bob         bob@company.com  40000.0     Accounting  919 555-1111
Carol       carol@company.c  30000.0     Sales       919 555-2222
Eve         eve@company.com  44000.0     Sales       919 555-3333
Frank       frank@comany.co  37000.0     Sales       919 555-4444

Inner Join - Explicit

sqlite> SELECT * FROM employees JOIN phone ON employees.name = phone.name;

name        email            salary      dept        name        phone       
----------  ---------------  ----------  ----------  ----------  ------------
Bob         bob@company.com  40000.0     Accounting  Bob         919 555-1111
Carol       carol@company.c  30000.0     Sales       Carol       919 555-2222
Eve         eve@company.com  44000.0     Sales       Eve         919 555-3333
Frank       frank@comany.co  37000.0     Sales       Frank       919 555-4444

Left Join - Natural

sqlite> SELECT * FROM employees NATURAL LEFT JOIN phone;

name        email              salary      dept        phone     
----------  -----------------  ----------  ----------  ----------
Alice       alice@company.com  52000.0     Accounting            
Bob         bob@company.com    40000.0     Accounting  919 555-11
Carol       carol@company.com  30000.0     Sales       919 555-22
Dave        dave@company.com   33000.0     Accounting            
Eve         eve@company.com    44000.0     Sales       919 555-33
Frank       frank@comany.com   37000.0     Sales       919 555-44

Left Join - Explicit

sqlite> SELECT * FROM employees LEFT JOIN phone ON employees.name = phone.name;

name        email              salary      dept        name        phone     
----------  -----------------  ----------  ----------  ----------  ----------
Alice       alice@company.com  52000.0     Accounting                        
Bob         bob@company.com    40000.0     Accounting  Bob         919 555-11
Carol       carol@company.com  30000.0     Sales       Carol       919 555-22
Dave        dave@company.com   33000.0     Accounting                        
Eve         eve@company.com    44000.0     Sales       Eve         919 555-33
Frank       frank@comany.com   37000.0     Sales       Frank       919 555-44

Other Joins

Currently SQLite does not support RIGHT JOIN or OUTER JOIN.

Creating an index

sqlite> CREATE INDEX index_name ON employees (name);
sqlite> .indices

index_name

sqlite> CREATE INDEX index_name_email ON employees (name,email);
sqlite> .indices

index_name
index_name_email

Subqueries

We can nest tables within tables for the purpose of queries.

SELECT * FROM (SELECT * FROM employees NATURAL LEFT JOIN phone) WHERE phone IS NULL;

name        email              salary      dept        phone     
----------  -----------------  ----------  ----------  ----------
Alice       alice@company.com  52000.0     Accounting            
Dave        dave@company.com   33000.0     Accounting

sqlite> SELECT * FROM (SELECT * FROM employees NATURAL LEFT JOIN phone) WHERE phone IS NOT NULL;

name        email            salary      dept        phone       
----------  ---------------  ----------  ----------  ------------
Bob         bob@company.com  40000.0     Accounting  919 555-1111
Carol       carol@company.c  30000.0     Sales       919 555-2222
Eve         eve@company.com  44000.0     Sales       919 555-3333
Frank       frank@comany.co  37000.0     Sales       919 555-4444

Excercise - Stupid SQL Tricks

Lets try to create a table that has a new column - abv_avg which contains how much more (or less) than the average, for their department, each person is paid.

Hint - This will require joining a subquery.

employees.sqlite is available in /data/Sta523 on gort.

Acknowledgments

Above materials are derived in part from the following sources:

dplyr and SQL

The why of databases

Numbers everyone should know

Implications for bigish data

Blocks

Linear vs Binary Search

Binary Search Example

and then?

SQL

Back to dplyr

data

Creating an sqlite database

Using dplyr with sqlite

SQL Query

Limitations

SQL Grouping

SQL Query

SQL Translation

(Unfair) Timings

Laziness

Full query

Creating Indexes

Timings for filtering precincts

Timings for grouping

RSQLite

DBI

RSQLite

Example Table

Removing Tables

Querying Tables

Closing the connection

SQL Queries

Table information

SELECT Statements

Pretty Output

select using SELECT

arrange using ORDER BY

filter via WHERE

group_by via GROUP BY

head via LIMIT

Import CSV files

Joins - Default

Inner Join

Inner Join - Explicit

Left Join - Natural

Left Join - Explicit

Other Joins

Creating an index

Subqueries

Excercise - Stupid SQL Tricks

Acknowledgments

Acknowledgments