Web Scraping

# Web Scraping
## Statistical Programming
### Adapted from STA 523, Professor Shawn Santo Presented by Morris Greenberg
### 11-05-19

---

# Web Scraping

---

# Why Web Scraping?
- Webpages contain lots of information.
    - We can see statistics of the best players on ESPN leaderboards (tables of data)
    - We can see customer reviews for different products on Amazon (text data)
    - We can see the closest Starbucks locations on its store locator (geolocation data)

- **Web Scraping** allows us to systemically extract and save the data from web pages.
  
<center>
<img src="images/Starbucks.png" height="300" width="450">
</center>

---

# HTML

---

## Hypertext Markup Language

- HTML describes the structure of a web page; your browser interprets the 
  structure and contents and displays the results.
  
- The basic building blocks include elements, tags, and attributes.
    - an element is a component of an HTML document
    - elements are generally wrapped in tags (start and end tag)
    - attributes provide additional information about HTML elements

---

## Simple HTML document

```html
<!DOCTYPE html>
<html>
<head>
<title>Web Scraping</title>
</head>
<body>

<h1>Using rvest</h1>
<p>To get started...</p>

</body>
</html>
```

We can visualize this in a tree-like structure...

---

## HTML as a tree

If we have access to an HTML document, then how can we easily 
extract information?

---

# `rvest`

---

## Package `rvest`

`rvest` is a package from Hadley Wickham that makes basic processing and 
manipulation of HTML data easy.

```r
library(rvest)
```

Core functions:

- `read_html()` - read HTML data from a url or character string

- `html_nodes()` - select specified nodes from the HTML document using CSS selectors

- `html_table()` - parse an HTML table into a data frame

- `html_text()` - extract tag pairs' content

- `html_name()` - extract tags' names

- `html_attrs()` - extract all of each tag's attributes

- `html_attr()` - extract tags' attribute value by name

---

## `html_document`

```r
simple_html <- 
"<html>
  <head>
    <title>Web Scraping</title>
  </head>
  <body>
  
    <h1>Using rvest</h1>
    <p>To get started...</p>
  
  </body>
</html>"
html_doc <- read_html(simple_html)
attributes(html_doc)
```

```
#> $names
#> [1] "node" "doc" 
#> 
#> $class
#> [1] "xml_document" "xml_node"
```

---

```r
html_doc
```

```
#> {html_document}
#> <html>
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
#> [2] <body>\n  \n    <h1>Using rvest</h1>\n    <p>To get started...</p>\n ...
```

---

## CSS selectors

To extract components out of HTML documents use `html_nodes()` and CSS selectors.
In CSS, selectors are patterns used to select elements you want to style.

We can determine the necessary CSS selectors we need via the point-and-click
tool [selector gadget](https://selectorgadget.com/). More on this in a moment.

Selector          |  Example         | Description
:-----------------|:-----------------|:--------------------------------------------------
element           |  `p`             | Select all &lt;p&gt; elements
element element   |  `div p`         | Select all &lt;p&gt; elements inside a &lt;div&gt; element
element>element   |  `div > p`       | Select all &lt;p&gt; elements with &lt;div&gt; as a parent
.class            |  `.title`        | Select all elements with class="title"
#id               |  `#name`         | Select all elements with id="name"
[attribute]       |  `[class]`       | Select all elements with a class attribute
[attribute=value] |  `[class=title]` | Select all elements with class="title"

]

For more CSS selector references click [here](https://www.w3schools.com/cssref/css_selectors.asp).

???

- CSS stands for Cascading Style Sheets.
 
- CSS describes how HTML elements are to be displayed on screen, paper, or 
  in other media.
 
- CSS can be added to HTML elements in 3 ways:
    - Inline - by using the style attribute in HTML elements
    - Internal - by using a <style> element in the <head> section
    - External - by using an external CSS file

---

## Examples

```r
html_swim <- 
'<html lang=en>
<head>
   <title>Rays Notebook: Open Water Swims 2019 — The Whole Shebang</title>
</head>
<body>
<main class=schedule>
<h1>The Whole Shebang</h1>

<p>This schedule lists every swim in the database. 396 events.</p>

<table class=schedule>
<thead><tr><th>Date</th><th>Location</th><th>Name</th><th>Distance</th><th>More</th></tr></thead>
<tbody>

<tr id=January>
<td class=date>Jan 13, Sun</td>
<td class=where>
   <a class=mapq href="http://www.google.com/maps/?q=27.865501,-82.631997">Petersburg, FL</a>
   <span class=more>
   Gandy Beach, Gandy Blvd N St, Petersburg, FL
   </span>
</td>
<td class=name><a href="http://tampabayfrogman.com/">Tampa Bay Frogman</a></td>
<td class=distance>5 km</td>
<td class=more><span class=time>7:15 AM</span>, Old Tampa Bay.</td>
</tr>
</body>
</html>'
```
]

---

```r
<html lang=en>
<head>
   <title>Rays Notebook: Open Water Swims 2019 — The Whole Shebang</title>
</head>
<body>
<main class=schedule>
<h1>The Whole Shebang</h1>

*<p>This schedule lists every swim in the database. 396 events.</p>

<table class=schedule>
<thead><tr><th>Date</th><th>Location</th><th>Name</th><th>Distance</th><th>More</th></tr></thead>
<tbody>

---

To extract all `<p>` elements

```r
html_swim %>% 
  read_html() %>% 
  html_nodes(css = "p")
```

```
#> {xml_nodeset (1)}
#> [1] <p>This schedule lists every swim in the database. 396 events.</p>
```

To extract the contents between the tags

```r
html_swim %>% 
  read_html() %>% 
  html_nodes(css = "p") %>% 
  html_text()
```

```
#> [1] "This schedule lists every swim in the database. 396 events."
```

---

```r
<html lang=en>
<head>
   <title>Rays Notebook: Open Water Swims 2019 — The Whole Shebang</title>
</head>
<body>
<main class=schedule>
<h1>The Whole Shebang</h1>

<p>This schedule lists every swim in the database. 396 events.</p>

<table class=schedule>
<thead><tr><th>Date</th><th>Location</th><th>Name</th><th>Distance</th><th>More</th></tr></thead>
<tbody>

<tr id=January>
<td class=date>Jan 13, Sun</td>
*<td class=where>
*  <a class=mapq href="http://www.google.com/maps/?q=27.865501,-82.631997">Petersburg, FL</a>
*  <span class=more>
*  Gandy Beach, Gandy Blvd N St, Petersburg, FL
*  </span>
*</td>
<td class=name><a href="http://tampabayfrogman.com/">Tampa Bay Frogman</a></td>
<td class=distance>5 km</td>
<td class=more><span class=time>7:15 AM</span>, Old Tampa Bay.</td>
</tr>
</body>
</html>
```
]

---

To select all elements with `class="where"`

```r
html_swim %>% 
  read_html() %>% 
  html_nodes(css = "[class=where]")
```

```
#> {xml_nodeset (1)}
#> [1] <td class="where">\n   <a class="mapq" href="http://www.google.com/m ...
```

To extract the text

```r
html_swim %>% 
  read_html() %>% 
  html_nodes(css = "[class=where]") %>% 
  html_text()
```

```
#> [1] "\n   Petersburg, FL\n   \n   Gandy Beach, Gandy Blvd N St, Petersburg, FL\n   \n"
```

To extract the attributes

```r
html_swim %>% 
  read_html() %>% 
  html_nodes(css = "[class=where]") %>% 
  html_attrs()
```

```
#> [[1]]
#>   class 
#> "where"
```

---

```r
<html lang=en>
<head>
   <title>Rays Notebook: Open Water Swims 2019 — The Whole Shebang</title>
</head>
<body>
<main class=schedule>
<h1>The Whole Shebang</h1>

<p>This schedule lists every swim in the database. 396 events.</p>

<table class=schedule>
<thead><tr><th>Date</th><th>Location</th><th>Name</th><th>Distance</th><th>More</th></tr></thead>
<tbody>

<tr id=January>
<td class=date>Jan 13, Sun</td>
<td class=where> 
*  <a class=mapq href="http://www.google.com/maps/?q=27.865501,-82.631997">Petersburg, FL</a>
   <span class=more> 
   Gandy Beach, Gandy Blvd N St, Petersburg, FL 
   </span> 
</td> 
*<td class=name><a href="http://tampabayfrogman.com/">Tampa Bay Frogman</a></td>
<td class=distance>5 km</td>
<td class=more><span class=time>7:15 AM</span>, Old Tampa Bay.</td>
</tr>
</body>
</html>
```
]

---

To extract the links (those with an `href` attribute)

```r
html_swim %>% 
  read_html() %>% 
  html_nodes(css = "[href]")
```

```
#> {xml_nodeset (2)}
#> [1] <a class="mapq" href="http://www.google.com/maps/?q=27.865501,-82.63 ...
#> [2] <a href="http://tampabayfrogman.com/">Tampa Bay Frogman</a>
```

To get only the links

```r
html_swim %>% 
  read_html() %>% 
  html_nodes(css = "[href]") %>% 
  html_attr("href")
```

```
#> [1] "http://www.google.com/maps/?q=27.865501,-82.631997"
#> [2] "http://tampabayfrogman.com/"
```

---

## SelectorGadget

[SelectorGadget](https://selectorgadget.com/) makes identifying the CSS 
selector you need by easily clicking on items on a webpage.

---

# Live demo

---

## Exercise

Scrape the Virginia Wegmans store names along
with each store's addresses and phone number (available at each store's link). 
Build a data frame that looks similar to what you see below.

```
#> # A tibble: 12 x 5
#>    store     state full_address               phone   website              
#>    <chr>     <chr> <chr>                      <chr>   <chr>                
#>  1 Alexandr~ VA    7905 Hilltop Village Cent~ 571-52~ https://www.wegmans.~
#>  2 Chantilly VA    14361 Newbrook Drive Chan~ 571-52~ https://www.wegmans.~
#>  3 Charlott~ VA    100 Wegmans Way Charlotte~ (434) ~ https://www.wegmans.~
#>  4 Dulles    VA    45131 Columbia Place Ster~ 703-42~ https://www.wegmans.~
#>  5 Fairfax   VA    11620 Monument Drive Fair~ 703-65~ https://www.wegmans.~
#>  6 Frederic~ VA    2281 Carl D. Silver Parkw~ 540-32~ https://www.wegmans.~
#>  7 Lake Man~ VA    8297 Stonewall Shops Squa~ 571-22~ https://www.wegmans.~
#>  8 Leesburg  VA    101 Crosstrail Blvd SE Le~ 703-66~ https://www.wegmans.~
#>  9 Midlothi~ VA    12501 Stone Village Way M~ 804-41~ https://www.wegmans.~
#> 10 Potomac   VA    14801 Dining Way Woodbrid~ 703-76~ https://www.wegmans.~
#> 11 Short Pu~ VA    12200 Wegmans Blvd Henric~ 804-37~ https://www.wegmans.~
#> 12 Virginia~ VA    4721 Virginia Beach Blvd ~ 757-27~ https://www.wegmans.~
```
]

If you have time, try and clean up the data frame implementing some regular
expressions and `stringr` functions from last class.

---

## References

- https://www.w3schools.com/

- https://selectorgadget.com/