--- title: databases & sql author: "Colin Rundel" date: "2018-11-12" output: xaringan::moon_reader: css: "slides.css" lib_dir: libs nature: highlightStyle: github highlightLines: true countIncrementalSlides: false --- exclude: true ```{r setup, echo=FALSE, message=FALSE, warning=FALSE, include=FALSE} options( htmltools.dir.version = FALSE, # for blogdown width = 80, tibble.width = 80 ) knitr::opts_chunk$set( fig.align = "center" ) htmltools::tagList(rmarkdown::html_dependency_font_awesome()) library(dplyr) ``` --- class: middle count: false # The why of databases --- ## Numbers everyone should know | Task | Timing | |-------------------------------------|-------------------| | L1 cache reference | 0.5 ns | | L2 cache reference | 7 ns | | Main memory reference | 100 ns | | Read 1 MB sequentially from memory | 250,000 ns | | Disk seek | 10,000,000 ns | | Read 1 MB sequentially from network | 10,000,000 ns | | Read 1 MB sequentially from disk | 30,000,000 ns | .footnote[ From: http://surana.wordpress.com/2009/01/01/numbers-everyone-should-know/ ] --- ## Implications for bigish data Lets imagine we have a *10 GB* flat data file and that we want to select certain rows based on a given criteria. This requires a sequential read across the entire data set. If we can store the file in memory: * $10~GB \times (250~\mu s / 1~MB) = 0.25$ seconds If we have to access the file from disk: * $10~GB \times (30~ms / 1~MB) = 30$ seconds
This is just for *reading* data, if we make any modifications (*writing*) things are much worse. --- ## Blocks .center[ *Cost*: Disk << Memory ]
.center[ *Speed*: Disk <<< Memory ]
So usually possible to grow our disk storage to accommodate our data. However, memory is usually the limiting resource, and if we can't fit everything into memory?
Create *blocks* - group related data (i.e. rows) and read in multiple rows at a time. Optimal size will depend on the task and the properties of the disk. --- ## Linear vs Binary Search Even with blocks, any kind of subsetting of rows requires a linear search, which requires $\mathcal{O}(N)$ accesses where $N$ is the number of blocks.
We can do much better if we are careful about how we structure our data, specifically sorting some or all of the columns. * Sorting is expensive, $\mathcal{O}(N \log N)$, but it only needs to be done once. * After sorting, we can use a binary search for any subsetting tasks ( $\mathcal{O}(\log N)$ ). * These sorted columns are known as *indexes*. * Indexes require additional storage, but usually small enough to be kept in memory while blocks stay on disk. --- ## Binary Search Example | Age | Name | |------|--------------| | 19 | Carol | | 20 | Greg | | 21 | Alice | | 21 | Dave | | 22 | Eve | | 23 | Bob | | 23 | Frank |

Lets search for records for people who are 22 or older. --- ## and then? This is just barely scratching the surface, * Efficiency gains are not just for disk, access is access * In general, trade off between storage and efficiency * Reality is a lot more complicated for everything mentioned so far, lots of very smart people have spent a lot of time thinking about and implementing tools * Different tasks with different requirements require different implementations and have different criteria for optimization --- class: middle count: false # Databases --- ## SQL Structures Query Language is a special purpose language for interacting with (querying and modifying) these indexed tabular data structures. * ANSI Standard but with some dialect divergence * This functionality maps very closely (but not exactly) with the data manipulation verbs present in dplyr. * We will see this mapping in more detail in a bit. --- ## DBI Low level package for interfacing R with Database management systems (DBMS) that provides a common interface to achieve the following functionality: * connect/disconnect from DB * create and execute statements in the DB * extract results/output from statements * error/exception handling * information (meta-data) from database objects * transaction management (optional) --- ## RSQLite Provides the implementation necessary to use DBI to interface with an SQLite database. ```r library(RSQLite) --- ## Loading required package: DBI con = dbConnect(RSQLite::SQLite(), ":memory:") str(con) ## Formal class 'SQLiteConnection' [package "RSQLite"] with 5 slots ## ..@ Id : ## ..@ dbname : chr ":memory:" ## ..@ loadable.extensions: logi TRUE ## ..@ flags : int 6 ## ..@ vfs : chr "" ``` --- ## Example Table ```r employees = data.frame(name = c("Alice","Bob","Carol","Dave","Eve","Frank"), email = c("alice@company.com", "bob@company.com", "carol@company.com", "dave@company.com", "eve@company.com", "frank@comany.com"), salary = c(52000, 40000, 30000, 33000, 44000, 37000), dept = c("Accounting", "Accounting","Sales", "Accounting","Sales","Sales"), stringsAsFactors = FALSE) ``` ```r dbWriteTable(con, "employees", employees) ## [1] TRUE dbListTables(con) ## [1] "employees" ``` --- ## Removing Tables ```r dbWriteTable(con, "employs", employees) ## [1] TRUE dbListTables(con) ## [1] "employees" "employs" dbRemoveTable(con,"employs") ## [1] TRUE dbListTables(con) ## [1] "employees" ``` --- ## Querying Tables ```r res = dbSendQuery(con, "SELECT * FROM employees") dbFetch(res) ## name email salary dept ## 1 Alice alice@company.com 52000 Accounting ## 2 Bob bob@company.com 40000 Accounting ## 3 Carol carol@company.com 30000 Sales ## 4 Dave dave@company.com 33000 Accounting ## 5 Eve eve@company.com 44000 Sales ## 6 Frank frank@comany.com 37000 Sales dbClearResult(res) ## [1] TRUE ``` --- ## Closing the connection ```r dbDisconnect(con) ## [1] TRUE ``` --- class: middle count: false # SQL Queries --- ## Connecting ```shell cr173@saxon [2018-11-12-sqlite]$ sqlite3 employees.sqlite SQLite version 3.20.1 2017-08-24 16:21:36 Enter ".help" for usage hints. ``` --- ## Table information The following is specific to SQLite ```sqlite sqlite> .tables employees ``` ```sqlite sqlite> .schema employees CREATE TABLE `employees` ( `name` TEXT, `email` TEXT, `salary` REAL, `dept` TEXT ); ``` ```sqlite sqlite> .indices employees ``` --- ## SELECT Statements ```sqlite sqlite> SELECT * FROM employees; Alice|alice@company.com|52000.0|Accounting Bob|bob@company.com|40000.0|Accounting Carol|carol@company.com|30000.0|Sales Dave|dave@company.com|33000.0|Accounting Eve|eve@company.com|44000.0|Sales Frank|frank@comany.com|37000.0|Sales ``` --- ## Pretty Output We can make this table output a little nicer with some additonal SQLite options: ```sqlite sqlite> .mode column sqlite> .headers on ``` ```sqlite sqlite> SELECT * FROM employees; name email salary dept ---------- ----------------- ---------- ---------- Alice alice@company.com 52000.0 Accounting Bob bob@company.com 40000.0 Accounting Carol carol@company.com 30000.0 Sales Dave dave@company.com 33000.0 Accounting Eve eve@company.com 44000.0 Sales Frank frank@comany.com 37000.0 Sales ``` --- ## select using SELECT We can subset for certain columns (and rename them) using `SELECT` ```sqlite sqlite> SELECT name AS first_name, salary FROM employees; first_name salary ---------- ---------- Alice 52000.0 Bob 40000.0 Carol 30000.0 Dave 33000.0 Eve 44000.0 Frank 37000.0 ``` --- ## arrange using ORDER BY We can sort our results by adding `ORDER BY` to our `SELECT` statement ```sqlite sqlite> SELECT name AS first_name, salary FROM employees ORDER BY salary; first_name salary ---------- ---------- Carol 30000.0 Dave 33000.0 Frank 37000.0 Bob 40000.0 Eve 44000.0 Alice 52000.0 ``` --- We can sort in the opposite order by adding `DESC` ```sqlite SELECT name AS first_name, salary FROM employees ORDER BY salary DESC; first_name salary ---------- ---------- Alice 52000.0 Eve 44000.0 Bob 40000.0 Frank 37000.0 Dave 33000.0 Carol 30000.0 ``` --- ## filter via WHERE We can filter rows by adding `WHERE` to our statements ```sqlite sqlite> SELECT * FROM employees WHERE salary < 40000; name email salary dept ---------- ----------------- ---------- ---------- Carol carol@company.com 30000.0 Sales Dave dave@company.com 33000.0 Accounting Frank frank@comany.com 37000.0 Sales sqlite> SELECT * FROM employees WHERE salary < 40000 AND dept = "Sales"; name email salary dept ---------- ----------------- ---------- ---------- Carol carol@company.com 30000.0 Sales Frank frank@comany.com 37000.0 Sales ``` --- ## group_by via GROUP BY We can create groups for the purpose of summarizing using `GROUP BY`. As with dplyr it is not terribly useful by itself. ```sqlite sqlite> SELECT * FROM employees GROUP BY dept; name email salary dept ---------- ---------------- ---------- ---------- Dave dave@company.com 33000.0 Accounting Frank frank@comany.com 37000.0 Sales sqlite> SELECT dept, AVG(salary) AS mean_salary, COUNT(*) AS n FROM employees GROUP BY dept; dept mean_salary n ---------- ---------------- ---------- Accounting 41666.6666666667 3 Sales 37000.0 3 ``` --- ## head via LIMIT We can limit the number of rows we get by using `LIMIT`. ```sqlite sqlite> SELECT * FROM employees LIMIT 3; name email salary dept ---------- ----------------- ---------- ---------- Alice alice@company.com 52000.0 Accounting Bob bob@company.com 40000.0 Accounting Carol carol@company.com 30000.0 Sales sqlite> SELECT * FROM employees ORDER BY name DESC LIMIT 3; name email salary dept ---------- ---------------- ---------- ---------- Frank frank@comany.com 37000.0 Sales Eve eve@company.com 44000.0 Sales Dave dave@company.com 33000.0 Accounting ``` --- ## Import CSV files ```sqlite sqlite> .mode csv sqlite> .import phone.csv phone sqlite> .tables employees phone sqlite> .mode column sqlite> SELECT * FROM phone; name phone ---------- ------------ Bob 919 555-1111 Carol 919 555-2222 Eve 919 555-3333 Frank 919 555-4444 ``` --- ## SQL Joins .center[ ```{r echo=FALSE} knitr::include_graphics("imgs/sql_joins.png") ``` ] --- ## Joins - Default By default SQLite uses a `CROSS JOIN` which is not terribly useful ```sqlite sqlite> SELECT * FROM employees JOIN phone; name email salary dept name phone ---------- ----------------- ---------- ---------- ---------- ------------ Alice alice@company.com 52000.0 Accounting Bob 919 555-1111 Alice alice@company.com 52000.0 Accounting Carol 919 555-2222 Alice alice@company.com 52000.0 Accounting Eve 919 555-3333 Alice alice@company.com 52000.0 Accounting Frank 919 555-4444 Bob bob@company.com 40000.0 Accounting Bob 919 555-1111 Bob bob@company.com 40000.0 Accounting Carol 919 555-2222 Bob bob@company.com 40000.0 Accounting Eve 919 555-3333 Bob bob@company.com 40000.0 Accounting Frank 919 555-4444 Carol carol@company.com 30000.0 Sales Bob 919 555-1111 Carol carol@company.com 30000.0 Sales Carol 919 555-2222 Carol carol@company.com 30000.0 Sales Eve 919 555-3333 Carol carol@company.com 30000.0 Sales Frank 919 555-4444 Dave dave@company.com 33000.0 Accounting Bob 919 555-1111 Dave dave@company.com 33000.0 Accounting Carol 919 555-2222 Dave dave@company.com 33000.0 Accounting Eve 919 555-3333 Dave dave@company.com 33000.0 Accounting Frank 919 555-4444 Eve eve@company.com 44000.0 Sales Bob 919 555-1111 Eve eve@company.com 44000.0 Sales Carol 919 555-2222 Eve eve@company.com 44000.0 Sales Eve 919 555-3333 Eve eve@company.com 44000.0 Sales Frank 919 555-4444 Frank frank@comany.com 37000.0 Sales Bob 919 555-1111 Frank frank@comany.com 37000.0 Sales Carol 919 555-2222 Frank frank@comany.com 37000.0 Sales Eve 919 555-3333 Frank frank@comany.com 37000.0 Sales Frank 919 555-4444 ``` --- ## Inner Join If you want SQLite to find the columns to merge on automatically then we prefix the join with `NATURAL`. ```sqlite sqlite> SELECT * FROM employees NATURAL JOIN phone; name email salary dept phone ---------- --------------- ---------- ---------- ------------ Bob bob@company.com 40000.0 Accounting 919 555-1111 Carol carol@company.c 30000.0 Sales 919 555-2222 Eve eve@company.com 44000.0 Sales 919 555-3333 Frank frank@comany.co 37000.0 Sales 919 555-4444 ``` --- ## Inner Join - Explicit ```sqlite sqlite> SELECT * FROM employees JOIN phone ON employees.name = phone.name; name email salary dept name phone ---------- --------------- ---------- ---------- ---------- ------------ Bob bob@company.com 40000.0 Accounting Bob 919 555-1111 Carol carol@company.c 30000.0 Sales Carol 919 555-2222 Eve eve@company.com 44000.0 Sales Eve 919 555-3333 Frank frank@comany.co 37000.0 Sales Frank 919 555-4444 ``` --- ## Left Join - Natural ```sqlite sqlite> SELECT * FROM employees NATURAL LEFT JOIN phone; name email salary dept phone ---------- ----------------- ---------- ---------- ---------- Alice alice@company.com 52000.0 Accounting Bob bob@company.com 40000.0 Accounting 919 555-11 Carol carol@company.com 30000.0 Sales 919 555-22 Dave dave@company.com 33000.0 Accounting Eve eve@company.com 44000.0 Sales 919 555-33 Frank frank@comany.com 37000.0 Sales 919 555-44 ``` --- ## Left Join - Explicit ```sqlite sqlite> SELECT * FROM employees LEFT JOIN phone ON employees.name = phone.name; name email salary dept name phone ---------- ----------------- ---------- ---------- ---------- ---------- Alice alice@company.com 52000.0 Accounting Bob bob@company.com 40000.0 Accounting Bob 919 555-11 Carol carol@company.com 30000.0 Sales Carol 919 555-22 Dave dave@company.com 33000.0 Accounting Eve eve@company.com 44000.0 Sales Eve 919 555-33 Frank frank@comany.com 37000.0 Sales Frank 919 555-44 ``` --- ## Other Joins Note that SQLite does not support directly support an `OUTER JOIN` or a `RIGHT JOIN`. --- ## Creating an index ```sqlite sqlite> CREATE INDEX index_name ON employees (name); sqlite> .indices index_name sqlite> CREATE INDEX index_name_email ON employees (name,email); sqlite> .indices index_name index_name_email ``` --- ## Subqueries We can nest tables within tables for the purpose of queries. ```sqlite SELECT * FROM (SELECT * FROM employees NATURAL LEFT JOIN phone) WHERE phone IS NULL; name email salary dept phone ---------- ----------------- ---------- ---------- ---------- Alice alice@company.com 52000.0 Accounting Dave dave@company.com 33000.0 Accounting ``` ```sqlite sqlite> SELECT * FROM (SELECT * FROM employees NATURAL LEFT JOIN phone) WHERE phone IS NOT NULL; name email salary dept phone ---------- --------------- ---------- ---------- ------------ Bob bob@company.com 40000.0 Accounting 919 555-1111 Carol carol@company.c 30000.0 Sales 919 555-2222 Eve eve@company.com 44000.0 Sales 919 555-3333 Frank frank@comany.co 37000.0 Sales 919 555-4444 ``` --- ## Excercise - Stupid SQL Tricks Lets try to create a table that has a new column - `abv_avg` which contains how much more (or less) than the average, for their department, each person is paid. Hint - This will require joining a subquery. `employees.sqlite` is available in the exercises repo.