class: center, middle, inverse, title-slide # Web Scraping ### Colin Rundel ### 2019-02-26 --- exclude: true ```r library(magrittr) library(rvest) ``` --- class: middle count: false # Web Scraping with rvest --- ## Hypertext Markup Language Most of the data on the web is still largely available as HTML - while it is structured (hierarchical / tree based) it often is not available in a form useful for analysis (flat / tidy). ```html <html> <head> <title>This is a title</title> </head> <body> <p align="center">Hello world!</p> <br/> <div class="name" id="first">John</div> <div class="name" id="last">Doe</div> <div class="contact"> <div class="home">555-555-1234</div> <div class="home">555-555-2345</div> <div class="work">555-555-9999</div> <div class="fax">555-555-8888</div> </div> </body> </html> ``` --- ## rvest `rvest` is a package from Hadley Wickham that makes basic processing and manipulation of HTML data straight forward. <br/> Core functions: * `read_html` - read HTML data from a url or character string. * `html_nodes` - select specified nodes from the HTML document usign CSS selectors. * `html_table` - parse an HTML table into a data frame. * `html_text` - extract tag pairs' content. * `html_name` - extract tags' names. * `html_attrs` - extract all of each tag's attributes. * `html_attr` - extract tags' attribute value by name. --- ## html, rvest, & xml2 ```r html = '<html> <head> <title>This is a title</title> </head> <body> <p align="center">Hello world!</p> <br/> <div class="name" id="first">John</div> <div class="name" id="last">Doe</div> <div class="contact"> <div class="home">555-555-1234</div> <div class="home">555-555-2345</div> <div class="work">555-555-9999</div> <div class="fax">555-555-8888</div> </div> </body> </html>' read_html(html) ``` ``` ## {xml_document} ## <html> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ... ## [2] <body>\n <p align="center">Hello world!</p>\n <br><div class="name" ... ``` --- ## css selectors We will be using a tool called selector gadget to help up identify the html elements of interest - it does this by constructing a css selector which can be used to subset the html document. .small[ Selector | Example | Description :-----------------|:-----------------|:-------------------------------------------------- element | `p` | Select all <p> elements element element | `div p` | Select all <p> elements inside a <div> element element>element | `div > p` | Select all <p> elements with <div> as a parent .class | `.title` | Select all elements with class="title" #id | `#name` | Select all elements with id="name" [attribute] | `[class]` | Select all elements with a class attribute [attribute=value] | `[class=title]` | Select all elements with class="title" ] --- ## Selecting tags ```r read_html(html) %>% html_nodes("p") ``` ``` ## {xml_nodeset (1)} ## [1] <p align="center">Hello world!</p> ``` -- ```r read_html(html) %>% html_nodes("p") %>% html_text() ``` ``` ## [1] "Hello world!" ``` -- ```r read_html(html) %>% html_nodes("p") %>% html_attrs() ``` ``` ## [[1]] ## align ## "center" ``` -- ```r read_html(html) %>% html_nodes("p") %>% html_attr("align") ``` ``` ## [1] "center" ``` --- ## More selecting tags ```r read_html(html) %>% html_nodes("div") ``` ``` ## {xml_nodeset (7)} ## [1] <div class="name" id="first">John</div> ## [2] <div class="name" id="last">Doe</div> ## [3] <div class="contact">\n <div class="home">555-555-1234</div>\n ... ## [4] <div class="home">555-555-1234</div> ## [5] <div class="home">555-555-2345</div> ## [6] <div class="work">555-555-9999</div> ## [7] <div class="fax">555-555-8888</div> ``` -- ```r read_html(html) %>% html_nodes("div") %>% html_text() ``` ``` ## [1] "John" ## [2] "Doe" ## [3] "\n 555-555-1234\n 555-555-2345\n 555-555-9999\n 555-555-8888\n " ## [4] "555-555-1234" ## [5] "555-555-2345" ## [6] "555-555-9999" ## [7] "555-555-8888" ``` --- ## Nesting tags ```r read_html(html) %>% html_nodes("body div") ``` ``` ## {xml_nodeset (7)} ## [1] <div class="name" id="first">John</div> ## [2] <div class="name" id="last">Doe</div> ## [3] <div class="contact">\n <div class="home">555-555-1234</div>\n ... ## [4] <div class="home">555-555-1234</div> ## [5] <div class="home">555-555-2345</div> ## [6] <div class="work">555-555-9999</div> ## [7] <div class="fax">555-555-8888</div> ``` -- ```r read_html(html) %>% html_nodes("body>div") ``` ``` ## {xml_nodeset (3)} ## [1] <div class="name" id="first">John</div> ## [2] <div class="name" id="last">Doe</div> ## [3] <div class="contact">\n <div class="home">555-555-1234</div>\n ... ``` -- ```r read_html(html) %>% html_nodes("body div div") ``` ``` ## {xml_nodeset (4)} ## [1] <div class="home">555-555-1234</div> ## [2] <div class="home">555-555-2345</div> ## [3] <div class="work">555-555-9999</div> ## [4] <div class="fax">555-555-8888</div> ``` --- ## CSS, classes, and ids ```r read_html(html) %>% html_nodes(".name") ``` ``` ## {xml_nodeset (2)} ## [1] <div class="name" id="first">John</div> ## [2] <div class="name" id="last">Doe</div> ``` -- ```r read_html(html) %>% html_nodes("div.name") ``` ``` ## {xml_nodeset (2)} ## [1] <div class="name" id="first">John</div> ## [2] <div class="name" id="last">Doe</div> ``` -- ```r read_html(html) %>% html_nodes("#first") ``` ``` ## {xml_nodeset (1)} ## [1] <div class="name" id="first">John</div> ``` --- ## Mixing it up ```r read_html(html) %>% html_nodes("[align]") ``` ``` ## {xml_nodeset (1)} ## [1] <p align="center">Hello world!</p> ``` ```r read_html(html) %>% html_nodes(".contact div") ``` ``` ## {xml_nodeset (4)} ## [1] <div class="home">555-555-1234</div> ## [2] <div class="home">555-555-2345</div> ## [3] <div class="work">555-555-9999</div> ## [4] <div class="fax">555-555-8888</div> ``` --- ## html tables ```r html_table = '<html> <head> <title>This is a title</title> </head> <body> <table> <tr> <th>a</th> <th>b</th> <th>c</th> </tr> <tr> <td>1</td> <td>2</td> <td>3</td> </tr> <tr> <td>2</td> <td>3</td> <td>4</td> </tr> <tr> <td>3</td> <td>4</td> <td>5</td> </tr> </table> </body> </html>' ``` -- ```r read_html(html_table) %>% html_nodes("table") %>% html_table() ``` ``` ## [[1]] ## a b c ## 1 1 2 3 ## 2 2 3 4 ## 3 3 4 5 ``` --- ## SelectorGadget This is a javascript based tool that helps you interactively build an appropriate CSS selector for the content you are interested in. <center> <img src='imgs/selectorgadget.png' width=500> <br/> <a href='http://selectorgadget.com/'>http://selectorgadget.com/</a> </center> --- ## Exercise ### Step 1 For the movies listed in the **Top Box Office** list on `rottentomatoes.com` create a data frame with the Movies' titles, their weekend gross, their tomatometer score, and whether the movie is fresh or rotten. <br/> ### Step 2 Using the url for each movie, now go out and grab the average rating, number of reviews, number of fresh and rotten reviews as well as the audience score, average audience rating and number of user ratings.