class: center, middle, inverse, title-slide # Web APIs ## Statistical Computing & Programming ### Shawn Santo ### 06-04-20 --- ## Supplementary materials Companion videos - [Getting data from dynamic websites](https://warpwire.duke.edu/w/b9EDAA/) - [Introduction to APIs](https://warpwire.duke.edu/w/cdEDAA/) - [HTTP](https://warpwire.duke.edu/w/c9EDAA/) - [Working with web APIs](https://warpwire.duke.edu/w/d9EDAA/) Additional resources - [HTTP tutorial](https://code.tutsplus.com/tutorials/http-the-protocol-every-web-developer-must-know-part-1--net-31177) - `httr` [vignette](https://cran.r-project.org/web/packages/httr/vignettes/quickstart.html) - [Public APIs](https://github.com/public-apis/public-apis) --- class: inverse, center, middle # Recall --- ## Limitations of using `rvest` functions - Difficult to make your code reproducible long term. When a website or the HTML changes your code may not work. <br/><br/> - CSS selectors change - Contents are moved - Switch from HTML to JavaScript <br><br> -- - Websites that rely heavily on JavaScript --- ## Possible solutions 1. Execute JavaScript in R 2. Use Chrome's developer tools 3. Use package `Rselenium` or other web drivers - http://ropensci.github.io/RSelenium/ <br><br> We'll focus on the second option... -- >In order for the information to get from their server and show up on a page in your browser, that information had to have been returned in an HTTP response somewhere. <br><br> It usually means that you won’t be making an HTTP request to the page’s URL that you see at the top of your browser window, but instead you’ll need to find the URL of the AJAX request that’s going on in the background to fetch the data from the server and load it into the page.<br><br> --- Hartley Brody --- class: inverse, center, middle # Introduction --- ## Application Programming Interface An API is a messenger that takes requests and returns responses. It allows for interaction between applications, databases, and devices. -- <br/> If you want to - embed a map on your website, you'll probably use Google's API - embed a tweet on your website, you'll probably use Twitter's API - trade stocks in Python or R, you'll probably use your broker's API - create 26 repositories named exam1-[github_name], you'll probably use GitHub's API <br/> There are [thousands of APIs](https://www.programmableweb.com/apis/directory) that exist. Most are integrated in a client-server framework. --- ## Old framework Requests return HTML pages that are relatively easy to scrape. ![](images/internet_old.png) *Source*: http://www.robert-drummond.com/2013/05/08/how-to-build-a-restful-web-api-on-a-raspberry-pi-in-javascript-2/ --- ## Client-server framework with an API The API facilitates communication between the web app and server/database. ![](images/internet_new.png) *Source*: http://www.robert-drummond.com/2013/05/08/how-to-build-a-restful-web-api-on-a-raspberry-pi-in-javascript-2/ --- class: inverse, center, middle # Protocols --- ## Protocols A computer protocol is a set of rules that govern how multiple computers communicate. - IP: Internet Protocol - FTP: File Transfer Protocol - HTTP: Hyper Text Transfer Protocol - The key protocol that governs data transfer over the internet - Allows HTML, CSS, JS to be transferred from a server to your browser - HTTPS: Hyper Text Transfer Protocol Secure -- <br/><br/> Why do we care? **Web APIs are built on HTTP**. Since so much of what we do is built over the web it is natural for web APIs to follow this protocol. --- ## HTTP requests .pull-left[ A client makes a request and includes - a uniform resource locator (URL) - `http://www.mit.edu/` - a method - GET, POST, PUT, DELETE, ... - headers - meta-information about the request - a body - possible data to send to the server ] .pull-right[ <br/><br/><br/><br/> ![](images/request_return_cycle.jpeg) <br/><br/> *Source*: https://zapier.com/learn/apis/ ] <br/><br/> .small-text[ <p align="right">Yes, MIT still uses http instead of https.<p> ] --- ## HTTP request, a closer look URL and method ```http Request URL: http://www.mit.edu/ Request Method: GET ``` -- Headers ```http Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9 Accept-Encoding: gzip, deflate Accept-Language: en-US,en;q=0.9 Cache-Control: max-age=0 Connection: keep-alive Cookie: _ga=GA1.2.1783336314.1582479131; _gid=GA1.2.2130535740.1582479131; QSI_HistorySession=http%3A%2F%2Fweb.mit.edu%2F~1582479131191 Host: web.mit.edu If-Modified-Since: Sun, 31 May 2020 05:00:23 GMT If-None-Match: "10e8a05a-86ab-5e5206e7" Upgrade-Insecure-Requests: 1 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 ``` --- ## HTTP responses .pull-left[ A server response includes - a three-digit status code - 1xx indicates an informational message only - 2xx indicates success of some kind - 3xx redirects the client to another URL - 4xx indicates an error on the client's part - 5xx indicates an error on the server's part - headers - meta-information about the response - a body - data from the server ] .pull-right[ <br/><br/><br/><br/> ![](images/request_return_cycle.jpeg) <br/><br/> *Source*: https://zapier.com/learn/apis/ ] --- ## HTTP response, a closer look Status code ```http Status Code: 200 OK ``` -- Headers ```http Accept-Ranges: bytes Content-Encoding: gzip Content-Length: 7663 Content-Type: text/html Date: Sun, 31 May 2020 17:32:10 GMT ETag: "10e8a05a-86ab-5e5206e7" Last-Modified: Sun, 31 May 2020 05:00:23 GMT Server: Apache Vary: Accept-Encoding X-Cnection: close ``` --- ## Example with package `httr` ```r library(httr) resp <- GET("https://stat.duke.edu") str(resp, max.level = 1) ``` ``` #> List of 10 #> $ url : chr "https://stat.duke.edu" #> $ status_code: int 200 #> $ headers :List of 27 #> ..- attr(*, "class")= chr [1:2] "insensitive" "list" #> $ all_headers:List of 1 #> $ cookies :'data.frame': 1 obs. of 7 variables: #> $ content : raw [1:84993] 3c 21 44 4f ... #> $ date : POSIXct[1:1], format: "2020-06-03 14:46:50" #> $ times : Named num [1:6] 0 0.0448 0.094 0.235 0.2679 ... #> ..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ... #> $ request :List of 7 #> ..- attr(*, "class")= chr "request" #> $ handle :Class 'curl_handle' <externalptr> #> - attr(*, "class")= chr "response" ``` --- ```r content(resp, "parsed") ``` ``` #> {html_document} #> <html class="no-js" lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/ dc: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ og: http://ogp.me/ns# rdfs: http://www.w3.org/2000/01/rdf-schema# sioc: http://rdfs.org/sioc/ns# sioct: http://rdfs.org/sioc/types# skos: http://www.w3.org/2004/02/skos/core# xsd: http://www.w3.org/2001/XMLSchema#"> #> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ... #> [2] <body class="html front not-logged-in">\n<div class="mm-page mm-slid ... ``` <br/><br/> If you are unable to scrape data with `rvest`, package `httr` is a great alternative before using `RSelenium`. --- class: inverse, center, middle # More on web APIs --- ## RESTful APIs **RE**presentational **S**tate **T**ransfer - describes an architectural style for web services (not a standard) - 6 guiding principles (constraints) - all communication via http requests - a REST API should specify what it can provide and how to use it, details such as query parameters, response format, request limitations, public use/API keys, method (GET/POST/PUT/DELETE), language support, etc --- ## More on URLs ![](images/url_structure.png) *Source*: [HTTP: The Protocol Every Web Developer Must Know](http://code.tutsplus.com/tutorials/http-the-protocol-every-web-developer-must-know-part-1--net-31177) <br/> - also for `https` - default port is 80 for `http` and 443 for `https`, typically not displayed - resource path is the local path to the resource on the server Examples: - `https://api.openbrewerydb.org/breweries` - `https://api.openbrewerydb.org/breweries?by_state=new+york` --- ## Query strings Provides named parameter(s) and value(s) that modify the behavior of the resulting page. <br/> Format generally follows: <br/> <center> field1=value1&field2=value2&field3=value3 </center> -- <br/> Some quick examples, * `https://api.petfinder.com/v2/animals?type=dog&page=2` * `https://app.ticketmaster.com/discovery/v2/events.json?attractionId=K8vZ917Gku7&countryCode=CA&apikey=RpD2faqwk2uio290` --- ## URL encoding This will often be handled automatically by your web browser or other tool, but it is useful to know a bit about what is happening - Spaces will encoded as '+' or '%20' - `https://api.openbrewerydb.org/breweries?by_state=new+york` - Certain characters are reserved and will be replaced with the percent-encoded version within a URL .small[ | ! | # | $ | & | ' | ( | ) | |:---:|:---:|:---:|:---:|:---:|:---:|:---:| | %21 | %23 | %24 | %26 | %27 | %28 | %29 | | * | + | , | / | : | ; | = | | %2A | %2B | %2C | %2F | %3A | %3B | %3D | | ? | @ | [ | ] | | %3F | %40 | %5B | %5D | ] - Characters that cannot be converted are replaced with HTML numeric character references (e.g. a Σ would be encoded as &#931; ) --- .tiny[ ```r URLencode("https://api.openbrewerydb.org/breweries?by_state=new york") ``` ``` #> [1] "https://api.openbrewerydb.org/breweries?by_state=new%20york" ``` ```r URLdecode("https://api.openbrewerydb.org/breweries?by_state=new%20york") ``` ``` #> [1] "https://api.openbrewerydb.org/breweries?by_state=new york" ``` ] -- .tiny[ ```r URLencode("! # $ & ' ( ) * + , / : ; = ? @ [ ]", reserved = TRUE) ``` ``` #> [1] "%21%20%23%20%24%20%26%20%27%20%28%20%29%20%2A%20%2B%20%2C%20%2F%20%3A%20%3B%20%3D%20%3F%20%40%20%5B%20%5D" ``` ```r URLdecode(URLencode("! # $ & ' ( ) * + , / : ; = ? @ [ ]", reserved = TRUE)) ``` ``` #> [1] "! # $ & ' ( ) * + , / : ; = ? @ [ ]" ``` ] -- .tiny[ ```r URLencode("Σ") ``` ``` #> [1] "%CE%A3" ``` ```r URLdecode("%CE%A3") ``` ``` #> [1] "Σ" ``` ] --- ## More on methods - GET - fetch a resource - POST - create a new resource - PUT - update a resource - DELETE - delete a resource <br/> Less common verbs: HEAD, TRACE, OPTIONS --- ## JSON: JavaScript Object Notation When exchanging data between a browser and a server, the data can only be text. JSON is the typical format and it is conveniently structured to be human and machine readable. - R package `jsonlite` has some functions that will make it easy to get JSON data into a workable form in R. - `read_json()` - read in JSON data as a list - `fromJSON()` - read in JSON trying to simplify it to a data frame <br/> To preview JSON data in your browser, check out https://codebeautify.org/jsonviewer --- ## Exercise Use the [Open Brewery API](https://www.openbrewerydb.org/) to answer the following questions. 1. How many breweries are located in Durham, NC? 2. Which city in North Carolina has the most micro breweries? How many micro breweries do they have? 3. In what cities are Founders, Yuengling, and Boulevard brewed? ??? ## Solutions ```r library(jsonlite) library(tidyverse) base_url <- "https://api.openbrewerydb.org/breweries" # question 1 query1 <- "?by_state=north+carolina&by_city=durham&per_page=50" read_json(str_c(base_url, query1)) %>% length() ``` ``` #> [1] 9 ``` ```r # question 2 get_nc_brew <- function(base, page) { query <- str_c("?by_state=north_carolina&by_type=micro&page=", page, "&per_page=50") fromJSON(str_c(base_url, query)) } nc_micro <- map(1:10, get_nc_brew, base = base_url) %>% map_df(rbind) %>% as_tibble() nc_micro %>% group_by(city) %>% summarise(count = n()) %>% arrange(desc(count)) %>% slice(1) ``` ``` #> # A tibble: 1 x 2 #> city count #> <chr> <int> #> 1 Raleigh 17 ``` ```r # question 3 brew <- c("founders", "yuengling", "boulevard") get_city_brew <- function(co) { query3 <- str_c("?by_name=", co) fromJSON(str_c(base_url, query3)) %>% pull(city) %>% unique() } map(brew, get_city_brew) %>% set_names(brew) ``` ``` #> $founders #> [1] "Grand Rapids" "Detroit" #> #> $yuengling #> [1] "Tampa" "Pottsville" #> #> $boulevard #> [1] "Kansas City" ``` --- ## References - https://httr.r-lib.org////index.html - [An introduction to APIs](https://zapier.com/learn/apis/) by Brian Cooksey - [HTTP: The Protocol Every Web Developer Must Know](http://code.tutsplus.com/tutorials/http-the-protocol-every-web-developer-must-know-part-1--net-31177)