Web Scraping Part II

# Web Scraping Part II
## Statistical Computing & Programming
### Shawn Santo
### 06-03-20

---

## Supplementary materials

Companion videos

- [Review of web scraping basics](https://warpwire.duke.edu/w/59ADAA/)
- [Book scrape code demo](https://warpwire.duke.edu/w/6dADAA/)
- [Web scraping best practices](https://warpwire.duke.edu/w/69ADAA/)
- [Beyond `rvest`](https://warpwire.duke.edu/w/7dADAA/)
- [Scraping dynamic websites code demo](https://warpwire.duke.edu/w/89ADAA/)

Additional resources

- [Web scraping cheat sheet](https://github.com/yusuzech/r-web-scraping-cheat-sheet/blob/master/README.md)
- `RSelenium` [website](http://ropensci.github.io/RSelenium/)

---

# Recall

---

## Hypertext Markup Language

- HTML describes the structure of a web page; your browser interprets the 
  structure and contents and displays the results.
  
- The basic building blocks include elements, tags, and attributes.
    - an element is a component of an HTML document
    - elements contain tags (start and end tag)
    - attributes provide additional information about HTML elements

---

## HTML vs. XML

.tiny.pull-left[
**HTML snippet**
```html
<tr>
<td class=date>Jul 11, Sat</td>
<td class=where>
 <a class=mapq href="http://www.google.com/maps/?q=43.639648,-71.779373">Bristol, NH</a>
 
 Wellington SP, West Shore Rd, Bristol, NH
 {43.639648,-71.779373}
 <a class="maplink" href="http://bing.com/maps/default.aspx?v=2&amp;cp=43.63965~-71.77937&amp;sp=point.43.63965_-71.77937_🏊&amp;style=r">🄱</a>
 <a class="maplink" href="http://www.google.com/maps/?q=43.63965,-71.77937">🄶</a>
 <a class="maplink" href="https://www.mapquest.com/latlng/43.63965,-71.77937">🄼</a>
 <a class="maplink" href="http://www.openstreetmap.org/?=&amp;mlat=43.63965&amp;mlon=-71.77937">🄾</a>
 
</td>
<td class=name><a href="http://www.swimnewfoundlake.com/">Swim with a Mission</a></td>
<td class=distance>5 km, 10 km, 10 mi</td>
<td class=more>7:00 AM, Newfound Lake. Ⓡ</td>
</tr>
```
]

.tiny.pull-right[
**XML snippet**
```xml
<swim>
 Swim with a Mission
 <location>
 Wellington SP, West Shore Rd, Bristol, NH
 </location>
 <link>
 http://www.swimnewfoundlake.com/
 </link>
 <date>
 Jul 11, Sat
 </date>
 <distance>
 5 km, 10 km, 10 mi
 </distance>
</swim>
```
]

---

## SelectorGadget

In CSS, selectors are patterns used to select the element(s) you want to style.

[SelectorGadget](https://selectorgadget.com/) makes identifying the CSS 
selector you need as easy as clicking on items on a webpage.

---

## Web scraping workflow

1. Understand the website's hierarchy and what information you need.

2. Use SelectorGadget to identify relevant CSS selectors.

3. Read html by passing a url and subset the resulting html document using
   CSS selectors.
    
    ```r
    read_html(url) %>% 
      html_nodes(css = "specified_css_selector")
    ```
--
4. Further extract attributes, text, or tags by adding another layer with
    
    ```r
    read_html(url) %>% 
      html_nodes(css = "specified_css_selector") %>% 
      html_*()
    ```
   where `*` is `text`, `attr`, `attrs`, `name`, or `table`.

---

## Example with `html_table()`

http://www.tornadohistoryproject.com/tornado/North-Carolina/2017/table

```r
library(rvest)
library(tidyverse)

url <- "http://www.tornadohistoryproject.com/tornado/North-Carolina/2017/table"

nc_tornado <- read_html(url) %>% 
 html_nodes("#results") %>% 
 html_table(header = TRUE) %>% 
 .[[1]] %>% 
 janitor::clean_names() %>% 
 select(date:lift_lon)

glimpse(nc_tornado)
```

```
#> Observations: 40
#> Variables: 15
#> $ date <chr> "2017-02-15", "2017-03-31", "2017-05-01", "201…
#> $ time <chr> "10:53:00 3", "16:15:00 3", "13:54:00 3", "01:…
#> $ state_s <chr> "North Carolina", "North Carolina", "North Car…
#> $ fujita <chr> "1", "1", "0", "0", "1", "0", "0", "0", "1", "…
#> $ fatalities <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "…
#> $ injuries <chr> "0", "0", "0", "0", "0", "0", "0", "0", "1", "…
#> $ width <chr> "60", "100", "50", "375", "250", "100", "75", …
#> $ length <chr> "3.18", "4.8", "2.67", "3.26", "1", "11.71", "…
#> $ affected_counties <chr> "Brunswick", "Bertie", "Catawba", "Rockingham"…
#> $ damage <chr> "$80000", "$250000", "$10000", "$40", "$100000…
#> $ crop_loss <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
#> $ touch_lat <chr> "34.006", "36.2081", "35.61", "36.4869", "36.3…
#> $ touch_lon <chr> "-78.6088", "-76.9334", "-81.2", "-79.7452", "…
#> $ lift_lat <chr> "34.009", "36.2213", "35.64", "36.534", "36.37…
#> $ lift_lon <chr> "-78.5534", "-76.8488", "-81.17", "-79.7494", …
```
]

---

## Overview

![](images/rvest_httr_fcns.png)

*Source*: https://github.com/yusuzech/r-web-scraping-cheat-sheet/blob/master/README.md

---

## Recall previous exercise

Go to http://books.toscrape.com/catalogue/page-1.html and scrape the first 
five pages of data on books with regards to their

1. title
2. price
3. star rating

Organize your results in a neatly formatted tibble similar to below.

```r
# A tibble: 100 x 3
 title price rating
 <chr> <chr> <chr> 
 1 A Light in the Attic £51.… Three 
 2 Tipping the Velvet £53.… One 
 3 Soumission £50.… One 
 4 Sharp Objects £47.… Four 
 5 Sapiens: A Brief History of Humankind £54.… Five 
 6 The Requiem Red £22.… One 
 7 The Dirty Little Secrets of Getting Your Dream J… £33.… Four 
 8 The Coming Woman: A Novel Based on the Life of t… £17.… Three 
 9 The Boys in the Boat: Nine Americans and Their E… £22.… Four 
10 The Black Maria £52.… One 
# … with 90 more rows
```

---

# Web scraping considerations

---

## Best practices

- Abide by a site's terms and conditions.

- Respect robots.txt.
    - https://www.facebook.com/robots.txt
    - https://www.wegmans.com/robots.txt
    - https://www.google.com/robots.txt
    
--
    
- Cache your `read_html()` chunks. Isolate these chunks.

- Avoid using `read_html()` in code that is iterated.

- Do not overload the server at peak hours.
    - Implement delayed crawls: `Sys.sleep(rexp(1) + 4)`
    
--

- If available, use a site's API.

- Do not violate any copyright laws.

---

## Tips

- Follow the best practices on the previous slide.
 
--

- Disguise your IP address.
 - `httr::use_proxy()`
 
--

- Avoid scraping behind pages protected by log-in, unless it is permitted
 by the site.
 - `html_session()`
 
--
 
- Watch out for honey pot traps - invisible links to normal visitors, 
 but present in HTML code and found by web scrapers.

---

# More on `rvest`

---

## Limitations of using `rvest` functions

- Difficult to make your code reproducible long term. When a website or the HTML
 changes your code may not work.
 
 - CSS selectors change
 - Contents are moved
 - Switch from HTML to JavaScript

--

- Websites that rely heavily on JavaScript

---

## What is JavaScript?

- Scripting language for building interactive web pages

- Basis for web, mobile, and network applications

- Every browser has a JavaScript engine that can execute JavaScript code.
 - Chrome: V8

If `read_html()` is meant for HTML, what can we do?

---

## Possible solutions

1. Execute JavaScript in R

2. Use Chrome's developer tools

3. Use package `Rselenium` or other web drivers
    - http://ropensci.github.io/RSelenium/

We'll focus on the second option...

>In order for the information to get from their server and show up on a page in 
your browser, that information had to have been returned in an HTTP response 
somewhere. 
It usually means that you won’t be making an HTTP request to the page’s URL 
that you see at the top of your browser window, but instead you’ll need to find 
the URL of the AJAX request that’s going on in the background to fetch the data 
from the server and load it into the page. 
--- Hartley Brody

---

# Live demo

---

## Exercise

Scrape all the QuikTrip stores within 25 miles of Tulsa, OK. 
Tidy the result in a data frame.
*Hint:* `html_children()`
   
???

## Start to solutions

```r
qt_xml <- read_html("https://hosted.where2getit.com/quiktrip/ajax?&xml_request=%3Crequest%3E%3Cappkey%3E82C11D38-0EC6-11E0-8AD9-4C59241F5146%3C%2Fappkey%3E%3Cformdata+id%3D%22locatorsearch%22%3E%3Cdataview%3Estore_default%3C%2Fdataview%3E%3Climit%3E200%3C%2Flimit%3E%3Cgeolocs%3E%3Cgeoloc%3E%3Caddressline%3ETulsa%2C+OK%3C%2Faddressline%3E%3Clongitude%3E%3C%2Flongitude%3E%3Clatitude%3E%3C%2Flatitude%3E%3C%2Fgeoloc%3E%3C%2Fgeolocs%3E%3Csearchradius%3E25%3C%2Fsearchradius%3E%3Cwhere%3E%3Ctravelcenter%3E%3Ceq%3E%3C%2Feq%3E%3C%2Ftravelcenter%3E%3Ctruckdiesel%3E%3Ceq%3E%3C%2Feq%3E%3C%2Ftruckdiesel%3E%3Ce15%3E%3Ceq%3E%3C%2Feq%3E%3C%2Fe15%3E%3Cautodiesel%3E%3Ceq%3E%3C%2Feq%3E%3C%2Fautodiesel%3E%3Cspecialtydrinks%3E%3Ceq%3E%3C%2Feq%3E%3C%2Fspecialtydrinks%3E%3Cgen3%3E%3Ceq%3E%3C%2Feq%3E%3C%2Fgen3%3E%3Chotandfreshpretzels%3E%3Ceq%3E%3C%2Feq%3E%3C%2Fhotandfreshpretzels%3E%3Chotsandwiches%3E%3Ceq%3E%3C%2Feq%3E%3C%2Fhotsandwiches%3E%3Cxlpizza%3E%3Ceq%3E%3C%2Feq%3E%3C%2Fxlpizza%3E%3Cpersonalpizzas%3E%3Ceq%3E%3C%2Feq%3E%3C%2Fpersonalpizzas%3E%3Cnoethanol%3E%3Ceq%3E%3C%2Feq%3E%3C%2Fnoethanol%3E%3Ccertifiedscales%3E%3Ceq%3E%3C%2Feq%3E%3C%2Fcertifiedscales%3E%3Cdef%3E%3Ceq%3E%3C%2Feq%3E%3C%2Fdef%3E%3Cfrozentreats%3E%3Ceq%3E%3C%2Feq%3E%3C%2Ffrozentreats%3E%3Cfreshbrewedtea%3E%3Ceq%3E%3C%2Feq%3E%3C%2Ffreshbrewedtea%3E%3Cfrozendrinks%3E%3Ceq%3E%3C%2Feq%3E%3C%2Ffrozendrinks%3E%3C%2Fwhere%3E%3C%2Fformdata%3E%3C%2Frequest%3E")

qt_stores <- qt_xml %>% 
 html_nodes("poi")
```

---

## References

- https://selectorgadget.com/

- https://github.com/yusuzech/r-web-scraping-cheat-sheet/blob/master/README.md