Web Scraping

# Web Scraping
### Yue Jiang

---

# HTML

---

## Hypertext Markup Language

- HTML describes the structure of a web page; your browser interprets the 
  structure and contents and displays the results.
  
- The basic building blocks include elements, tags, and attributes.
    - an element is a component of an HTML document
    - elements contain tags (start and end tag)
    - attributes provide additional information about HTML elements

---

## Simple HTML document

```html
<html>
<head>
<title>Web Scraping</title>
</head>
<body>

<h1>Using rvest</h1>
<p>To get started...</p>

</body>
</html>
```

We can visualize this in a tree-like structure.

---

## HTML tree-like structure

If we have access to an HTML document, then how can we easily 
extract information?

---

# Package `rvest`

---

## Package `rvest`

`rvest` is a package authored by Hadley Wickham that makes basic processing and 
manipulation of HTML documents easy.

```r
library(tidyverse)
library(rvest)
```

Core functions:

.small-text[
| Function            | Description                                                       |
|---------------------|-------------------------------------------------------------------|
| `xml2::read_html()` | read HTML from a character string or URL                          |
| `html_nodes()`      | select specified pieces from the HTML document using CSS selectors|
| `html_table()`      | parse an HTML table into a data frame                             |
| `html_text()`       | extract content                                                   |
| `html_name()`       | extract tag names                                                 |
| `html_attrs()`      | extract all attributes and values                                 |
| `html_attr()`       | extract value for a specified attribute's name                    |
]

---

## HTML in R

We'll create a simple HTML document as a string to demonstrate some of 
these functions.

```r
simple_html <- "<html>
<head>
<title>Web Scraping</title>
</head>
<body>
<h1>Using rvest</h1>
<p>To get started...</p>
</body>
</html>"
```

Preview our character object:

```r
simple_html
```

```
#> [1] "<html>\n<head>\n<title>Web Scraping</title>\n</head>\n<body>\n<h1>Using rvest</h1>\n<p>To get started...</p>\n</body>\n</html>"
```

---

## HTML in R

Read in the document with `read_html()`.

```r
html_simple <- read_html(simple_html)
```

<br/>

What does this look like?

```r
html_simple
```

```
#> {html_document}
#> <html>
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#> [2] <body>\n<h1>Using rvest</h1>\n<p>To get started...</p>\n</body>
```

---

## Subset with `html_nodes()`

Let's extract the highlighted component below.

```r
<html>
<head>
<title>Web Scraping</title>
</head>
<body>

*<h1>Using rvest</h1>
<p>To get started...</p>

</body>
</html>
```

```r
h1_nodes <- html_nodes(html_simple, css = "h1")
h1_nodes
```

```
#> {xml_nodeset (1)}
#> [1] <h1>Using rvest</h1>
```

---

## Extract contents and tag name

Let's extract "Using rvest" and `h1`.

```r
<html>
<head>
<title>Web Scraping</title>
</head>
<body>

*<h1>Using rvest</h1>
<p>To get started...</p>

</body>
</html>
```

```r
h1_nodes %>% 
  html_text()
```

```
#> [1] "Using rvest"
```

```r
h1_nodes %>% 
  html_name()
```

```
#> [1] "h1"
```

---

## Scaling up

Most HTML documents are not as simple as what we just examined. There may be
tables, hundreds of links, paragraphs of text, and more. Naturally, we may
wonder:

<br/>

1. How do we handle larger HTML documents? (see next slide)

2. How do we know what to provide to `css` in function `html_nodes()` when
   we attempt to subset the HTML document?
   
3. Are these functions in `rvest` vectorized? For instance, are we able to get 
   all the content in the `td` tags on the slide that follows?

<br/>

In Chrome, you can view the HTML document associated with a web page by going
to `View > Developer > View Source`.

---

.tiny[
```html
<html lang=en>
<head>
   <title>Rays Notebook: Open Water Swims 2020 — The Whole Shebang</title>
</head>
<body>
<main class=schedule>
<h1>The Whole Shebang</h1>

<p>This schedule lists every swim in the database. 383 events.</p>

<table class=schedule>
<thead><tr><th>Date</th><th>Location</th><th>Name</th><th>Distance</th><th>More</th></tr></thead>
<tbody>

<tr id=January>
<td class=date>Jan 12, Sun</td>
<td class=where>
   <a class=mapq href="http://www.google.com/maps/?q=27.865501,-82.631997">Petersburg, FL</a>
   <span class=more>
   Gandy Beach, Gandy Blvd N St, Petersburg, FL
   </span>
</td>
<td class=name><a href="http://tampabayfrogman.com/">Tampa Bay Frogman</a></td>
<td class=distance>5 km</td>
<td class=more><span class=time>7:15 AM</span>, Old Tampa Bay.</td>
</tr>
</body>
</html>
```
]

This is a snippet from HTML document associated with the website
[here](https://raysnotebook.info/ows/schedules/The%20Whole%20Shebang.html).

---

# CSS and SelectorGadget

---

## CSS selectors
.tiny-text[
To extract components out of HTML documents use `html_nodes()` and CSS selectors.
In CSS, selectors are patterns used to select elements you want to style.

We can 
determine the necessary CSS selectors we need via the point-and-click
tool [selector gadget](https://selectorgadget.com/). More on this in a moment.]

---

## CSS selectors

| Selector          | Example         | `html_nodes()` `css` value           | Description; Select all                         |
|-------------------|-----------------|--------------------------------------|-------------------------------------------------|
| element           | `p`             | `html_nodes(x, css = "p")`           | &lt;p&gt; elements                              |
| element element   | `div p`         | `html_nodes(x, css = "div p")`       | &lt;p&gt; elements inside a &lt;div&gt; element |
| .class            | `.title`        | `html_nodes(x, css = ".title")`      | elements with class="title"                     |
| #id               | `#name`         | `html_nodes(x, css = "#name")`       | elements with id="name"                         |
| [attribute]       | `[class]`       | `html_nodes(x, css = "[class]")`     | elements with a class attribute                 |
| [attribute=value] | `[href='www']`  | `html_nodes(x, css = "[href='www']")`| elements with class="title"                     |

]

For more CSS selector references click [here](https://www.w3schools.com/cssref/css_selectors.asp).

???

- CSS stands for Cascading Style Sheets.
 
- CSS describes how HTML elements are to be displayed on screen, paper, or 
  in other media.
 
- CSS can be added to HTML elements in 3 ways:
    - Inline - by using the style attribute in HTML elements
    - Internal - by using a <style> element in the <head> section
    - External - by using an external CSS file

---

## SelectorGadget

[SelectorGadget](https://selectorgadget.com/) makes identifying the CSS 
selector you need by easily clicking on items on a webpage.

---

# Live demo

---

Let's go to http://books.toscrape.com/catalogue/page-1.html and scrape the first 
five pages of data on books with regards to their

1. title
2. price
3. star rating

We'll organize our results in a neatly formatted tibble similar to below.

```r
# A tibble: 100 x 3
   title                                             price rating
   <chr>                                             <chr> <chr> 
 1 A Light in the Attic                              £51.… Three 
 2 Tipping the Velvet                                £53.… One   
 3 Soumission                                        £50.… One   
 4 Sharp Objects                                     £47.… Four  
 5 Sapiens: A Brief History of Humankind             £54.… Five  
 6 The Requiem Red                                   £22.… One   
 7 The Dirty Little Secrets of Getting Your Dream J… £33.… Four  
 8 The Coming Woman: A Novel Based on the Life of t… £17.… Three 
 9 The Boys in the Boat: Nine Americans and Their E… £22.… Four  
10 The Black Maria                                   £52.… One   
# … with 90 more rows
```

**Code is given in the presentation notes. Hit `P`.**

???

## Solution

```r
# example for page 1, see how everything works
url <- "http://books.toscrape.com/catalogue/page-1.html"

read_html(url) %>% 
  html_nodes(css = ".price_color") %>% 
  html_text()

read_html(url) %>% 
  html_nodes(css = ".product_pod a") %>% 
  html_attr("title") %>% 
  .[!is.na(.)]

read_html(url) %>% 
  html_nodes(css = ".star-rating") %>% 
  html_attr(name = "class") %>% 
  str_remove(pattern = "star-rating ")

# turn our code into a function
get_books <- function(page) {
  
  base_url <- "http://books.toscrape.com/catalogue/page-"
  url <- str_c(base_url, page, ".html")
  
  books_html <- read_html(url)
  
  prices <- books_html %>% 
    html_nodes(css = ".price_color") %>% 
    html_text()
  
  titles <- books_html %>% 
    html_nodes(css = ".product_pod a") %>% 
    html_attr("title") %>% 
    .[!is.na(.)]

ratings <- books_html %>% 
    html_nodes(css = ".star-rating") %>% 
    html_attr(name = "class") %>% 
    str_remove(pattern = "star-rating ")
  
  books_df <- tibble(
    title  = titles,
    price  = prices,
    rating = ratings
  )
  
  return(books_df)
}

# iterate across pages using our function
pages <- 1:5
books <- map_df(pages, get_books)

books
```

---

## Web scraping workflow

1. Understand the website's hierarchy and what information you need.

2. Read and save the HTML document from the URL.
    
    ```r
    html_obj <- read_html("www.website-to-scrape.com")
    ```
--
3. Use SelectorGadget to identify relevant CSS selectors.

4. Subset the resulting html document using CSS selectors.
    
    ```r
    html_obj %>% 
      html_nodes(css = "specified_css_selector")
    ```
--
5. Further extract attributes, text, or tags by adding another layer with
    
    ```r
    html_obj %>% 
      html_nodes(css = "specified_css_selector") %>% 
      html_*()
    ```
   where `*` is `text`, `attr`, `attrs`, `name`, or `table`.

---

## References

1. Easily Harvest (Scrape) Web Pages. (2020). Rvest.tidyverse.org. Retrieved 
   from https://rvest.tidyverse.org/

2. W3Schools Online Web Tutorials. (2020). W3schools.com. Retrieved 
   from https://www.w3schools.com/

3. SelectorGadget: point and click CSS selectors. (2020). Selectorgadget.com. 
   Retrieved from https://selectorgadget.com/

---

## Your turn!

[https://classroom.github.com/a/7G5i51px](https://classroom.github.com/a/7G5i51px)