class: center, middle, inverse, title-slide # Web Scraping ### Yue Jiang --- class: inverse, center, middle # HTML --- ## Hypertext Markup Language - HTML describes the structure of a web page; your browser interprets the structure and contents and displays the results. - The basic building blocks include elements, tags, and attributes. - an element is a component of an HTML document - elements contain tags (start and end tag) - attributes provide additional information about HTML elements <center> <img src="images/html-structure.png" height="300" width="450"> </center> --- ## Simple HTML document ```html <html> <head> <title>Web Scraping</title> </head> <body> <h1>Using rvest</h1> <p>To get started...</p> </body> </html> ``` <br/><br/> We can visualize this in a tree-like structure. --- ## HTML tree-like structure <center> <img src="images/html-tree.png" height="450" width="550"> </center> If we have access to an HTML document, then how can we easily extract information? --- class: inverse, center, middle # Package `rvest` --- ## Package `rvest` `rvest` is a package authored by Hadley Wickham that makes basic processing and manipulation of HTML documents easy. ```r library(tidyverse) library(rvest) ``` Core functions: .small-text[ | Function | Description | |---------------------|-------------------------------------------------------------------| | `xml2::read_html()` | read HTML from a character string or URL | | `html_nodes()` | select specified pieces from the HTML document using CSS selectors| | `html_table()` | parse an HTML table into a data frame | | `html_text()` | extract content | | `html_name()` | extract tag names | | `html_attrs()` | extract all attributes and values | | `html_attr()` | extract value for a specified attribute's name | ] --- ## HTML in R We'll create a simple HTML document as a string to demonstrate some of these functions. ```r simple_html <- "<html> <head> <title>Web Scraping</title> </head> <body> <h1>Using rvest</h1> <p>To get started...</p> </body> </html>" ``` -- Preview our character object: ```r simple_html ``` ``` #> [1] "<html>\n<head>\n<title>Web Scraping</title>\n</head>\n<body>\n<h1>Using rvest</h1>\n<p>To get started...</p>\n</body>\n</html>" ``` --- ## HTML in R Read in the document with `read_html()`. ```r html_simple <- read_html(simple_html) ``` -- <br/> What does this look like? -- ```r html_simple ``` ``` #> {html_document} #> <html> #> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ... #> [2] <body>\n<h1>Using rvest</h1>\n<p>To get started...</p>\n</body> ``` --- ## Subset with `html_nodes()` Let's extract the highlighted component below. ```r <html> <head> <title>Web Scraping</title> </head> <body> *<h1>Using rvest</h1> <p>To get started...</p> </body> </html> ``` -- ```r h1_nodes <- html_nodes(html_simple, css = "h1") h1_nodes ``` ``` #> {xml_nodeset (1)} #> [1] <h1>Using rvest</h1> ``` --- ## Extract contents and tag name Let's extract "Using rvest" and `h1`. ```r <html> <head> <title>Web Scraping</title> </head> <body> *<h1>Using rvest</h1> <p>To get started...</p> </body> </html> ``` -- ```r h1_nodes %>% html_text() ``` ``` #> [1] "Using rvest" ``` ```r h1_nodes %>% html_name() ``` ``` #> [1] "h1" ``` --- ## Scaling up Most HTML documents are not as simple as what we just examined. There may be tables, hundreds of links, paragraphs of text, and more. Naturally, we may wonder: <br/> 1. How do we handle larger HTML documents? (see next slide) 2. How do we know what to provide to `css` in function `html_nodes()` when we attempt to subset the HTML document? 3. Are these functions in `rvest` vectorized? For instance, are we able to get all the content in the `td` tags on the slide that follows? <br/> In Chrome, you can view the HTML document associated with a web page by going to `View > Developer > View Source`. --- .tiny[ ```html <html lang=en> <head> <title>Rays Notebook: Open Water Swims 2020 — The Whole Shebang</title> </head> <body> <main class=schedule> <h1>The Whole Shebang</h1> <p>This schedule lists every swim in the database. 383 events.</p> <table class=schedule> <thead><tr><th>Date</th><th>Location</th><th>Name</th><th>Distance</th><th>More</th></tr></thead> <tbody> <tr id=January> <td class=date>Jan 12, Sun</td> <td class=where> <a class=mapq href="http://www.google.com/maps/?q=27.865501,-82.631997">Petersburg, FL</a> <span class=more> Gandy Beach, Gandy Blvd N St, Petersburg, FL </span> </td> <td class=name><a href="http://tampabayfrogman.com/">Tampa Bay Frogman</a></td> <td class=distance>5 km</td> <td class=more><span class=time>7:15 AM</span>, Old Tampa Bay.</td> </tr> </body> </html> ``` ] This is a snippet from HTML document associated with the website [here](https://raysnotebook.info/ows/schedules/The%20Whole%20Shebang.html). --- class: middle, center, inverse # CSS and SelectorGadget --- ## CSS selectors .tiny-text[ To extract components out of HTML documents use `html_nodes()` and CSS selectors. In CSS, selectors are patterns used to select elements you want to style. We can determine the necessary CSS selectors we need via the point-and-click tool [selector gadget](https://selectorgadget.com/). More on this in a moment.] --- ## CSS selectors .tiny-text[ | Selector | Example | `html_nodes()` `css` value | Description; Select all | |-------------------|-----------------|--------------------------------------|-------------------------------------------------| | element | `p` | `html_nodes(x, css = "p")` | <p> elements | | element element | `div p` | `html_nodes(x, css = "div p")` | <p> elements inside a <div> element | | .class | `.title` | `html_nodes(x, css = ".title")` | elements with class="title" | | #id | `#name` | `html_nodes(x, css = "#name")` | elements with id="name" | | [attribute] | `[class]` | `html_nodes(x, css = "[class]")` | elements with a class attribute | | [attribute=value] | `[href='www']` | `html_nodes(x, css = "[href='www']")`| elements with class="title" | ] For more CSS selector references click [here](https://www.w3schools.com/cssref/css_selectors.asp). ??? - CSS stands for Cascading Style Sheets. - CSS describes how HTML elements are to be displayed on screen, paper, or in other media. - CSS can be added to HTML elements in 3 ways: - Inline - by using the style attribute in HTML elements - Internal - by using a <style> element in the <head> section - External - by using an external CSS file --- ## SelectorGadget [SelectorGadget](https://selectorgadget.com/) makes identifying the CSS selector you need by easily clicking on items on a webpage. <center> <iframe title="vimeo-player" src="https://player.vimeo.com/video/52055686" width="800" height="400" frameborder="0" allowfullscreen></iframe> </center> --- class: inverse, center, middle # Live demo --- Let's go to http://books.toscrape.com/catalogue/page-1.html and scrape the first five pages of data on books with regards to their 1. title 2. price 3. star rating We'll organize our results in a neatly formatted tibble similar to below. ```r # A tibble: 100 x 3 title price rating <chr> <chr> <chr> 1 A Light in the Attic £51.… Three 2 Tipping the Velvet £53.… One 3 Soumission £50.… One 4 Sharp Objects £47.… Four 5 Sapiens: A Brief History of Humankind £54.… Five 6 The Requiem Red £22.… One 7 The Dirty Little Secrets of Getting Your Dream J… £33.… Four 8 The Coming Woman: A Novel Based on the Life of t… £17.… Three 9 The Boys in the Boat: Nine Americans and Their E… £22.… Four 10 The Black Maria £52.… One # … with 90 more rows ``` **Code is given in the presentation notes. Hit `P`.** ??? ## Solution ```r # example for page 1, see how everything works url <- "http://books.toscrape.com/catalogue/page-1.html" read_html(url) %>% html_nodes(css = ".price_color") %>% html_text() read_html(url) %>% html_nodes(css = ".product_pod a") %>% html_attr("title") %>% .[!is.na(.)] read_html(url) %>% html_nodes(css = ".star-rating") %>% html_attr(name = "class") %>% str_remove(pattern = "star-rating ") # turn our code into a function get_books <- function(page) { base_url <- "http://books.toscrape.com/catalogue/page-" url <- str_c(base_url, page, ".html") books_html <- read_html(url) prices <- books_html %>% html_nodes(css = ".price_color") %>% html_text() titles <- books_html %>% html_nodes(css = ".product_pod a") %>% html_attr("title") %>% .[!is.na(.)] ratings <- books_html %>% html_nodes(css = ".star-rating") %>% html_attr(name = "class") %>% str_remove(pattern = "star-rating ") books_df <- tibble( title = titles, price = prices, rating = ratings ) return(books_df) } # iterate across pages using our function pages <- 1:5 books <- map_df(pages, get_books) books ``` --- ## Web scraping workflow 1. Understand the website's hierarchy and what information you need. -- 2. Read and save the HTML document from the URL. ```r html_obj <- read_html("www.website-to-scrape.com") ``` -- 3. Use SelectorGadget to identify relevant CSS selectors. -- 4. Subset the resulting html document using CSS selectors. ```r html_obj %>% html_nodes(css = "specified_css_selector") ``` -- 5. Further extract attributes, text, or tags by adding another layer with ```r html_obj %>% html_nodes(css = "specified_css_selector") %>% html_*() ``` where `*` is `text`, `attr`, `attrs`, `name`, or `table`. --- ## References 1. Easily Harvest (Scrape) Web Pages. (2020). Rvest.tidyverse.org. Retrieved from https://rvest.tidyverse.org/ 2. W3Schools Online Web Tutorials. (2020). W3schools.com. Retrieved from https://www.w3schools.com/ 3. SelectorGadget: point and click CSS selectors. (2020). Selectorgadget.com. Retrieved from https://selectorgadget.com/ --- ## Your turn! [https://classroom.github.com/a/7G5i51px](https://classroom.github.com/a/7G5i51px)