class: center, middle, inverse, title-slide # Web Scraping ## Statistical Programming ### Adapted from STA 523, Professor Shawn Santo Presented by Morris Greenberg ### 11-05-19 --- class: inverse, center, middle # Web Scraping --- # Why Web Scraping? - Webpages contain lots of information. - We can see statistics of the best players on ESPN leaderboards (tables of data) - We can see customer reviews for different products on Amazon (text data) - We can see the closest Starbucks locations on its store locator (geolocation data) - **Web Scraping** allows us to systemically extract and save the data from web pages. <center> <img src="images/Starbucks.png" height="300" width="450"> </center> --- class: inverse, center, middle # HTML --- ## Hypertext Markup Language - HTML describes the structure of a web page; your browser interprets the structure and contents and displays the results. - The basic building blocks include elements, tags, and attributes. - an element is a component of an HTML document - elements are generally wrapped in tags (start and end tag) - attributes provide additional information about HTML elements <center> <img src="images/html-structure.png" height="300" width="450"> </center> --- ## Simple HTML document ```html <!DOCTYPE html> <html> <head> <title>Web Scraping</title> </head> <body> <h1>Using rvest</h1> <p>To get started...</p> </body> </html> ``` <br/><br/> We can visualize this in a tree-like structure... --- ## HTML as a tree <center> <img src="images/html-tree.png" height="400" width="500"> </center> If we have access to an HTML document, then how can we easily extract information? --- class: inverse, center, middle # `rvest` --- ## Package `rvest` `rvest` is a package from Hadley Wickham that makes basic processing and manipulation of HTML data easy. ```r library(rvest) ``` Core functions: - `read_html()` - read HTML data from a url or character string - `html_nodes()` - select specified nodes from the HTML document using CSS selectors - `html_table()` - parse an HTML table into a data frame - `html_text()` - extract tag pairs' content - `html_name()` - extract tags' names - `html_attrs()` - extract all of each tag's attributes - `html_attr()` - extract tags' attribute value by name --- ## `html_document` ```r simple_html <- "<html> <head> <title>Web Scraping</title> </head> <body> <h1>Using rvest</h1> <p>To get started...</p> </body> </html>" html_doc <- read_html(simple_html) attributes(html_doc) ``` ``` #> $names #> [1] "node" "doc" #> #> $class #> [1] "xml_document" "xml_node" ``` --- ```r html_doc ``` ``` #> {html_document} #> <html> #> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ... #> [2] <body>\n \n <h1>Using rvest</h1>\n <p>To get started...</p>\n ... ``` --- ## CSS selectors To extract components out of HTML documents use `html_nodes()` and CSS selectors. In CSS, selectors are patterns used to select elements you want to style. We can determine the necessary CSS selectors we need via the point-and-click tool [selector gadget](https://selectorgadget.com/). More on this in a moment. .small-text[ Selector | Example | Description :-----------------|:-----------------|:-------------------------------------------------- element | `p` | Select all <p> elements element element | `div p` | Select all <p> elements inside a <div> element element>element | `div > p` | Select all <p> elements with <div> as a parent .class | `.title` | Select all elements with class="title" #id | `#name` | Select all elements with id="name" [attribute] | `[class]` | Select all elements with a class attribute [attribute=value] | `[class=title]` | Select all elements with class="title" ] For more CSS selector references click [here](https://www.w3schools.com/cssref/css_selectors.asp). ??? - CSS stands for Cascading Style Sheets. - CSS describes how HTML elements are to be displayed on screen, paper, or in other media. - CSS can be added to HTML elements in 3 ways: - Inline - by using the style attribute in HTML elements - Internal - by using a <style> element in the <head> section - External - by using an external CSS file --- ## Examples .tiny[ ```r html_swim <- '<html lang=en> <head> <title>Rays Notebook: Open Water Swims 2019 — The Whole Shebang</title> </head> <body> <main class=schedule> <h1>The Whole Shebang</h1> <p>This schedule lists every swim in the database. 396 events.</p> <table class=schedule> <thead><tr><th>Date</th><th>Location</th><th>Name</th><th>Distance</th><th>More</th></tr></thead> <tbody> <tr id=January> <td class=date>Jan 13, Sun</td> <td class=where> <a class=mapq href="http://www.google.com/maps/?q=27.865501,-82.631997">Petersburg, FL</a> <span class=more> Gandy Beach, Gandy Blvd N St, Petersburg, FL </span> </td> <td class=name><a href="http://tampabayfrogman.com/">Tampa Bay Frogman</a></td> <td class=distance>5 km</td> <td class=more><span class=time>7:15 AM</span>, Old Tampa Bay.</td> </tr> </body> </html>' ``` ] --- .tiny[ ```r <html lang=en> <head> <title>Rays Notebook: Open Water Swims 2019 — The Whole Shebang</title> </head> <body> <main class=schedule> <h1>The Whole Shebang</h1> *<p>This schedule lists every swim in the database. 396 events.</p> <table class=schedule> <thead><tr><th>Date</th><th>Location</th><th>Name</th><th>Distance</th><th>More</th></tr></thead> <tbody> <tr id=January> <td class=date>Jan 13, Sun</td> <td class=where> <a class=mapq href="http://www.google.com/maps/?q=27.865501,-82.631997">Petersburg, FL</a> <span class=more> Gandy Beach, Gandy Blvd N St, Petersburg, FL </span> </td> <td class=name><a href="http://tampabayfrogman.com/">Tampa Bay Frogman</a></td> <td class=distance>5 km</td> <td class=more><span class=time>7:15 AM</span>, Old Tampa Bay.</td> </tr> </body> </html> ``` ] --- To extract all `<p>` elements ```r html_swim %>% read_html() %>% html_nodes(css = "p") ``` ``` #> {xml_nodeset (1)} #> [1] <p>This schedule lists every swim in the database. 396 events.</p> ``` To extract the contents between the tags ```r html_swim %>% read_html() %>% html_nodes(css = "p") %>% html_text() ``` ``` #> [1] "This schedule lists every swim in the database. 396 events." ``` --- .tiny[ ```r <html lang=en> <head> <title>Rays Notebook: Open Water Swims 2019 — The Whole Shebang</title> </head> <body> <main class=schedule> <h1>The Whole Shebang</h1> <p>This schedule lists every swim in the database. 396 events.</p> <table class=schedule> <thead><tr><th>Date</th><th>Location</th><th>Name</th><th>Distance</th><th>More</th></tr></thead> <tbody> <tr id=January> <td class=date>Jan 13, Sun</td> *<td class=where> * <a class=mapq href="http://www.google.com/maps/?q=27.865501,-82.631997">Petersburg, FL</a> * <span class=more> * Gandy Beach, Gandy Blvd N St, Petersburg, FL * </span> *</td> <td class=name><a href="http://tampabayfrogman.com/">Tampa Bay Frogman</a></td> <td class=distance>5 km</td> <td class=more><span class=time>7:15 AM</span>, Old Tampa Bay.</td> </tr> </body> </html> ``` ] --- To select all elements with `class="where"` ```r html_swim %>% read_html() %>% html_nodes(css = "[class=where]") ``` ``` #> {xml_nodeset (1)} #> [1] <td class="where">\n <a class="mapq" href="http://www.google.com/m ... ``` -- To extract the text ```r html_swim %>% read_html() %>% html_nodes(css = "[class=where]") %>% html_text() ``` ``` #> [1] "\n Petersburg, FL\n \n Gandy Beach, Gandy Blvd N St, Petersburg, FL\n \n" ``` -- To extract the attributes ```r html_swim %>% read_html() %>% html_nodes(css = "[class=where]") %>% html_attrs() ``` ``` #> [[1]] #> class #> "where" ``` --- .tiny[ ```r <html lang=en> <head> <title>Rays Notebook: Open Water Swims 2019 — The Whole Shebang</title> </head> <body> <main class=schedule> <h1>The Whole Shebang</h1> <p>This schedule lists every swim in the database. 396 events.</p> <table class=schedule> <thead><tr><th>Date</th><th>Location</th><th>Name</th><th>Distance</th><th>More</th></tr></thead> <tbody> <tr id=January> <td class=date>Jan 13, Sun</td> <td class=where> * <a class=mapq href="http://www.google.com/maps/?q=27.865501,-82.631997">Petersburg, FL</a> <span class=more> Gandy Beach, Gandy Blvd N St, Petersburg, FL </span> </td> *<td class=name><a href="http://tampabayfrogman.com/">Tampa Bay Frogman</a></td> <td class=distance>5 km</td> <td class=more><span class=time>7:15 AM</span>, Old Tampa Bay.</td> </tr> </body> </html> ``` ] --- To extract the links (those with an `href` attribute) ```r html_swim %>% read_html() %>% html_nodes(css = "[href]") ``` ``` #> {xml_nodeset (2)} #> [1] <a class="mapq" href="http://www.google.com/maps/?q=27.865501,-82.63 ... #> [2] <a href="http://tampabayfrogman.com/">Tampa Bay Frogman</a> ``` -- To get only the links ```r html_swim %>% read_html() %>% html_nodes(css = "[href]") %>% html_attr("href") ``` ``` #> [1] "http://www.google.com/maps/?q=27.865501,-82.631997" #> [2] "http://tampabayfrogman.com/" ``` --- ## SelectorGadget [SelectorGadget](https://selectorgadget.com/) makes identifying the CSS selector you need by easily clicking on items on a webpage. <center> <iframe title="vimeo-player" src="https://player.vimeo.com/video/52055686" width="600" height="400" frameborder="0" allowfullscreen></iframe> </center> --- class: inverse, center, middle # Live demo --- ## Exercise Scrape the Virginia Wegmans store names along with each store's addresses and phone number (available at each store's link). Build a data frame that looks similar to what you see below. .tiny[ ``` #> # A tibble: 12 x 5 #> store state full_address phone website #> <chr> <chr> <chr> <chr> <chr> #> 1 Alexandr~ VA 7905 Hilltop Village Cent~ 571-52~ https://www.wegmans.~ #> 2 Chantilly VA 14361 Newbrook Drive Chan~ 571-52~ https://www.wegmans.~ #> 3 Charlott~ VA 100 Wegmans Way Charlotte~ (434) ~ https://www.wegmans.~ #> 4 Dulles VA 45131 Columbia Place Ster~ 703-42~ https://www.wegmans.~ #> 5 Fairfax VA 11620 Monument Drive Fair~ 703-65~ https://www.wegmans.~ #> 6 Frederic~ VA 2281 Carl D. Silver Parkw~ 540-32~ https://www.wegmans.~ #> 7 Lake Man~ VA 8297 Stonewall Shops Squa~ 571-22~ https://www.wegmans.~ #> 8 Leesburg VA 101 Crosstrail Blvd SE Le~ 703-66~ https://www.wegmans.~ #> 9 Midlothi~ VA 12501 Stone Village Way M~ 804-41~ https://www.wegmans.~ #> 10 Potomac VA 14801 Dining Way Woodbrid~ 703-76~ https://www.wegmans.~ #> 11 Short Pu~ VA 12200 Wegmans Blvd Henric~ 804-37~ https://www.wegmans.~ #> 12 Virginia~ VA 4721 Virginia Beach Blvd ~ 757-27~ https://www.wegmans.~ ``` ] <br/><br/> If you have time, try and clean up the data frame implementing some regular expressions and `stringr` functions from last class. --- ## References - https://www.w3schools.com/ - https://selectorgadget.com/