class: center, middle, inverse, title-slide # Databases and SQL ## Statistical Computing & Programming ### Shawn Santo ### 06-11-20 --- ## Supplementary materials Companion videos - [Introduction to databases](https://warpwire.duke.edu/w/hdUDAA/) - [Translating dplyr to SQL](https://warpwire.duke.edu/w/h9UDAA/) - [SQL and R](https://warpwire.duke.edu/w/idUDAA/) Additional resources - Introduction to `dbplyr` [vignette](https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html) --- ## Databases A **database** is a collection of data typically stored in a computer system. It is controlled by a **database management system (DBMS)**. There may be applications associated with them, such as an API. <br/> -- Types of DBMS: MySQL, Microsoft Access, Microsoft SQL Server, FileMaker Pro, Oracle Database, and dBASE. <br/> Types of databases: Relational, object-oriented, distributed, NoSQL, graph, and more. --- ## DBMS benefits - Lower storage and retrieval costs <br/> - Easy data access <br/> - Backup and recovery <br/> - Data consistency --- ## Relational database management system - A system that governs a relational database, where data is identified and accessed in relation to other data in the database. <br/><br/> - Relational databases generally organize data into tables comprised of fields and records. -- <br/><br/><br/><br/> - Many RDBMS use SQL to access data. --- ## SQL - SQL stands for Structured Query Language <br/><br/><br/> -- - It is an American National Standards Institute standard computer language for accessing and manipulating RDBMS. <br/><br/><br/> -- - There are different versions of SQL, but to be compliant with the American National Standards Institute the version must support the key query verbs. --- ## Big picture <center> <img src="images/sql-big-picture.gif"> </center> <br/> *Source*: https://www.w3resource.com/sql/tutorials.php --- class: inverse, center, middle # Translation to SQL --- ## `dbplyr` Package `dbplyr` allows you to query a database by automatically generating SQL queries. We'll use it as a starting point to see the connection between `dplyr` verbs (functions) and `SQL` verbs before we transition using SQL. <br/><br/> To get started, load the packages. ```r library(dbplyr) library(dplyr) ``` We'll create a temporary database called `airports_db` from `nycflights13::airports` --- ## Creating an in-memory database ```r con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:") ``` ```r copy_to(con, df = nycflights13::airports, "airports") DBI::dbListTables(con) ``` ``` #> [1] "airports" "sqlite_stat1" "sqlite_stat4" ``` ```r airports_db <- tbl(con, "airports") ``` --- ```r airports_db ``` ``` #> # Source: table<airports> [?? x 8] #> # Database: sqlite 3.29.0 [:memory:] #> faa name lat lon alt tz dst tzone #> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> #> 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_… #> 2 06A Moton Field Municipa… 32.5 -85.7 264 -6 A America/Chic… #> 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chic… #> 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_… #> 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New_… #> 6 0A9 Elizabethton Municip… 36.4 -82.2 1593 -5 A America/New_… #> 7 0G6 Williams County Airp… 41.5 -84.5 730 -5 A America/New_… #> 8 0G7 Finger Lakes Regiona… 42.9 -76.8 492 -5 A America/New_… #> 9 0P2 Shoestring Aviation … 39.8 -76.6 1000 -5 U America/New_… #> 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/Los_… #> # … with more rows ``` -- <br/><br/><br/> What is different from our typical tibble object? --- ## Example NYC flights to airports by time zone. ```r airport_timezone <- airports_db %>% group_by(tzone) %>% summarise(count = n()) ``` ```r airport_timezone ``` ``` #> # Source: lazy query [?? x 2] #> # Database: sqlite 3.29.0 [:memory:] #> tzone count #> <chr> <int> #> 1 <NA> 3 #> 2 America/Anchorage 239 #> 3 America/Chicago 342 #> 4 America/Denver 119 #> 5 America/Los_Angeles 176 #> 6 America/New_York 519 #> 7 America/Phoenix 38 #> 8 America/Vancouver 2 #> 9 Asia/Chongqing 2 #> 10 Pacific/Honolulu 18 ``` --- ## Translation to SQL .pull-left[ ```r airport_timezone %>% show_query() ``` ``` #> <SQL> #> SELECT `tzone`, COUNT() AS `count` #> FROM `airports` #> GROUP BY `tzone` ``` ] .pull-right[ ```r airports_db %>% group_by(tzone) %>% summarise(count = n()) ``` ] <br/><br/><br/><br/><br/><br/> -- What are the `dplyr` translations to SQL? --- ## Exercise What are the corresponding SQL verbs based on the `dplyr` structure below? ```r airport_car <- airports_db %>% filter(lat >= 33.7666, lat <= 36.588, lon >= -84.3201, lon <= -75.4129) %>% arrange(desc(alt)) %>% select(name, alt) ``` ??? ## Solution .solution[ ```r airport_car %>% show_query() ``` ``` #> <SQL> #> SELECT `name`, `alt` #> FROM (SELECT * #> FROM (SELECT * #> FROM `airports` #> WHERE ((`lat` >= 33.7666) AND (`lat` <= 36.588) AND (`lon` >= -84.3201) AND (`lon` <= -75.4129))) #> ORDER BY `alt` DESC) ``` See later slides for full translation. ] --- ## Limitations ```r tail(airport_car) Error: tail() is not supported by sql sources ``` -- ```r airports_db %>% filter(lat >= 33.7666, lat <= 36.588, lon >= -84.3201, lon <= -75.4129) %>% arrange(desc(alt)) %>% select(name, alt) %>% slice(1:3) Error in UseMethod("slice_") : no applicable method for 'slice_' applied to an object of class "c('tbl_SQLiteConnection', 'tbl_dbi', 'tbl_sql', 'tbl_lazy', 'tbl')" ``` -- ```r airports_db %>% filter(lat >= 33.7666, lat <= 36.588, lon >= -84.3201, lon <= -75.4129) %>% select(name, alt) %>% filter(stringr::str_detect(name, pattern="Raleigh")) Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) : object 'name' not found ``` --- ## Lazy remote queries ```r airport_car <- airports_db %>% filter(lat >= 33.7666, lat <= 36.588, lon >= -84.3201, lon <= -75.4129) %>% arrange(desc(alt)) %>% select(name, alt) %>% * collect() ``` - Data is never pulled into R unless you explicitly ask for it with `collect()` - Work is delayed until the moment it is required. Until I ask for `airport_car`, nothing communicates with the database. --- ## Close connection ```r DBI::dbDisconnect(con) ``` --- class: inverse, center, middle # SQL and R --- ## Create a database Set up a relational database management system and include some baseball data from package `Lahman`. ```r library(RSQLite) library(Lahman) ``` ```r con <- dbConnect(RSQLite::SQLite(), ":memory:") dbWriteTable(con, "batting", Batting) dbWriteTable(con, "pitching", Pitching) dbWriteTable(con, "teams", Teams) ``` --- ## Seeing tables and fields ```r dbListTables(con) ``` ``` #> [1] "batting" "pitching" "teams" ``` -- ```r dbListFields(con, "teams") %>% head() ``` ``` #> [1] "yearID" "lgID" "teamID" "franchID" "divID" "Rank" ``` -- ```r dbListFields(con, "pitching") ``` ``` #> [1] "playerID" "yearID" "stint" "teamID" "lgID" "W" #> [7] "L" "G" "GS" "CG" "SHO" "SV" #> [13] "IPouts" "H" "ER" "HR" "BB" "SO" #> [19] "BAOpp" "ERA" "IBB" "WP" "HBP" "BK" #> [25] "BFP" "GF" "R" "SH" "SF" "GIDP" ``` --- ## Common SQL query structure Main verbs to get data: ```sql SELECT columns or computations FROM table WHERE condition GROUP BY columns HAVING condition ORDER BY column [ASC | DESC] LIMIT offset, count ``` `WHERE`, `GROUP BY`, `HAVING`, `ORDER BY`, `LIMIT` are all optional. Primary computations: `MIN`, `MAX`, `COUNT`, `SUM`, `AVG`. <br/><br/> We can perform these queries with `dbGetQuery()` and `paste()`. --- ## Verb connections | SQL | `dplyr` | |---------:|:----------------------------------------| | SELECT | `select()` | | table | data frame | | WHERE | `filter()` pre-aggregation/calculation | | GROUP_BY | `group_by()` | | HAVING | `filter()` post-aggregation/calculation | | ORDER BY | `arrange()` with possibly a `desc()` | | LIMIT | `slice()` | --- ## Examples Examine some attendance numbers ```r dbGetQuery(con, paste("SELECT yearID, franchID, attendance", "FROM teams", "LIMIT 5")) ``` ``` #> yearID franchID attendance #> 1 1871 BNA NA #> 2 1871 CNA NA #> 3 1871 CFC NA #> 4 1871 KEK NA #> 5 1871 NNA NA ``` -- ```r dbGetQuery(con, paste("SELECT yearID, franchID, attendance", "FROM teams", "WHERE yearID >= 2000", "LIMIT 5")) ``` ``` #> yearID franchID attendance #> 1 2000 ANA 2066982 #> 2 2000 ARI 2942251 #> 3 2000 ATL 3234304 #> 4 2000 BAL 3297031 #> 5 2000 BOS 2585895 ``` --- What happens if we change the order or query structure? -- <br/><br/> ```r *dbGetQuery(con, paste("FROM teams", "SELECT yearID, franchID, attendance", "WHERE yearID >= 2000", "LIMIT 5")) ``` ``` #> Error: near "FROM": syntax error ``` --- Get the average attendance for each franchise since 2010 and show the top 10. -- ```r dbGetQuery(con, paste("SELECT yearID, franchID, attendance, AVG(attendance)", "FROM teams", "WHERE yearID >= 2010", "ORDER BY AVG(attendance) DESC", "LIMIT 10")) ``` ``` #> yearID franchID attendance AVG(attendance) #> 1 2010 ARI 2056941 2438242 ``` -- <br/><br/><br/> What is wrong? --- Get the average attendance for each franchise since 2010 and show the top 10. ```r dbGetQuery(con, paste("SELECT yearID, franchID, attendance, AVG(attendance)", "FROM teams", "WHERE yearID >= 2010", * "GROUP BY franchID", "ORDER BY AVG(attendance) DESC", "LIMIT 10")) ``` ``` #> yearID franchID attendance AVG(attendance) #> 1 2010 LAD 3562320 3604339 #> 2 2010 NYY 3765807 3392236 #> 3 2010 STL 3301218 3376067 #> 4 2010 SFG 3037443 3299843 #> 5 2010 ANA 3250816 3073673 #> 6 2010 CHC 3062973 2976743 #> 7 2010 BOS 3046443 2953583 #> 8 2010 COL 2875245 2774276 #> 9 2010 TEX 2505171 2736283 #> 10 2010 MIL 2776531 2704837 ``` -- <br/><br/><br/> Do we need `yearID` and `attendance` is our `SELECT` line? --- Get the average attendance for each franchise since 2010 and show the top 10. ```r *dbGetQuery(con, paste("SELECT franchID, AVG(attendance)", "FROM teams", "WHERE yearID >= 2010", "GROUP BY franchID", "ORDER BY AVG(attendance) DESC", "LIMIT 10")) ``` ``` #> franchID AVG(attendance) #> 1 LAD 3604339 #> 2 NYY 3392236 #> 3 STL 3376067 #> 4 SFG 3299843 #> 5 ANA 3073673 #> 6 CHC 2976743 #> 7 BOS 2953583 #> 8 COL 2774276 #> 9 TEX 2736283 #> 10 MIL 2704837 ``` --- Which players had at least 300 strikeouts (SO) in a season between 1960 and 1990? -- ```r *dbGetQuery(con, paste("SELECT playerID, yearID, MAX(SO) as maxK", "FROM pitching", "WHERE yearID >= 1960 AND yearID <= 1990", "GROUP BY playerID, yearID", "HAVING maxK > 300", "ORDER BY maxK DESC")) ``` ``` #> playerID yearID maxK #> 1 ryanno01 1973 383 #> 2 koufasa01 1965 382 #> 3 ryanno01 1974 367 #> 4 ryanno01 1977 341 #> 5 ryanno01 1972 329 #> 6 ryanno01 1976 327 #> 7 mcdowsa01 1965 325 #> 8 koufasa01 1966 317 #> 9 richajr01 1979 313 #> 10 carltst01 1972 310 #> 11 lolicmi01 1971 308 #> 12 koufasa01 1963 306 #> 13 scottmi03 1986 306 #> 14 mcdowsa01 1970 304 #> 15 richajr01 1978 303 #> 16 bluevi01 1971 301 #> 17 ryanno01 1989 301 ``` **Can we restructure the query?** --- Which players had at least 300 strikeouts (SO) in a season between 1960 and 1990? ```r dbGetQuery(con, paste("SELECT playerID, yearID, MAX(SO) as maxK", "FROM pitching", * "HAVING maxK > 300", "GROUP BY playerID, yearID", * "WHERE yearID >= 1960 AND yearID <= 1990", "ORDER BY maxK DESC")) ``` ``` #> Error: near "GROUP": syntax error ``` -- <br/><br/> ```r dbGetQuery(con, paste("SELECT yearID, franchID, attendance", "FROM teams", * "HAVING yearID >= 2000", "LIMIT 5")) ``` ``` #> Error: a GROUP BY clause is required before HAVING ``` --- ## SQL arithmetic and comparison operators SQL supports the standard `+`, `-`, `*`, `/`, and `%` (modulo) arithmetic operators and the following comparison operators. <br/> | Operator | Description | |:--------:|:-------------------------| | `=` | Equal to | | `>` | Greater than | | `<` | Less than | | `>=` | Greater than or equal to | | `<=` | Less than or equal to | | `<>` | Not equal to | --- ## SQL logical operators | Operator | Description | |----------:|:-------------------------------------------------------------| | `ALL` | TRUE if all of the subquery values meet the condition | | `AND` | TRUE if all the conditions separated by AND is TRUE | | `ANY` | TRUE if any of the subquery values meet the condition | | `BETWEEN` | TRUE if the operand is within the range of comparisons | | `EXISTS` | TRUE if the subquery returns one or more records | | `IN` | TRUE if the operand is equal to one of a list of expressions | | `LIKE` | TRUE if the operand matches a pattern | | `NOT` | Displays a record if the condition(s) is NOT TRUE | | `OR` | TRUE if any of the conditions separated by OR is TRUE | | `SOME` | TRUE if any of the subquery values meet the condition | --- ## Exercises 1. Add `Salaries` from package `Lahman` as a table to your in-memory database. 2. Compute the team salaries for each team in 2016 and display the 5 teams with the highest payroll. Which team had the lowest payroll in that year? 3. Which 10 teams had the highest winning percentage since 1990? *Hint*: https://www.w3schools.com/sql/func_sqlserver_cast.asp 4. How would you combine the batting and salaries tables to match up the players and years? Take a look at `?dplyr::join`. Try to combine the R data frame objects `Batting` and `Salaries`. ??? ## Solutions .solution[ ```r dbWriteTable(con, "salaries", Salaries) dbGetQuery(con, paste("SELECT teamID, SUM(salary) as payroll", "FROM salaries", "WHERE yearID = 2016", "GROUP BY teamID", "ORDER BY payroll DESC", "LIMIT 5")) dbGetQuery(con, paste("SELECT yearID, teamID, W, L, CAST(W as FLOAT) / G as wpct", "FROM teams", "WHERE yearID >= 1990", "GROUP BY yearID, teamID", "ORDER BY wpct DESC", "LIMIT 10")) left_join(Batting, Salaries, by = c("playerID", "yearID")) %>% glimpse() ``` ] --- ## References - https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html - https://www.w3resource.com/sql/tutorials.php