--- title: Web Scraping author: "Colin Rundel" date: "2018-10-03" output: xaringan::moon_reader: css: "slides.css" lib_dir: libs nature: highlightStyle: github highlightLines: true countIncrementalSlides: false --- exclude: true ```{r, message=FALSE, warning=FALSE, include=FALSE} options( htmltools.dir.version = FALSE, # for blogdown width = 80, tibble.width = 80 ) knitr::opts_chunk$set( fig.align = "center" ) htmltools::tagList(rmarkdown::html_dependency_font_awesome()) ``` ```{r setup, message=FALSE} library(magrittr) library(rvest) ``` --- class: middle count: false # Web Scraping with rvest --- ## Hypertext Markup Language Most of the data on the web is still largely available as HTML - while it is structured (hierarchical / tree based) it often is not available in a form useful for analysis (flat / tidy). ```html This is a title

Hello world!


John
Doe
555-555-1234
555-555-2345
555-555-9999
555-555-8888
``` --- ## rvest `rvest` is a package from Hadley Wickham that makes basic processing and manipulation of HTML data straight forward.
Core functions: * `read_html` - read HTML data from a url or character string. * `html_nodes` - select specified nodes from the HTML document usign CSS selectors. * `html_table` - parse an HTML table into a data frame. * `html_text` - extract tag pairs' content. * `html_name` - extract tags' names. * `html_attrs` - extract all of each tag's attributes. * `html_attr` - extract tags' attribute value by name. --- ## html, rvest, & xml2 ```{r} html = ' This is a title

Hello world!


John
Doe
555-555-1234
555-555-2345
555-555-9999
555-555-8888
' read_html(html) ``` --- ## css selectors We will be using a tool called selector gadget to help up identify the html elements of interest - it does this by constructing a css selector which can be used to subset the html document. .small[ Selector | Example | Description :-----------------|:-----------------|:-------------------------------------------------- element | `p` | Select all <p> elements element element | `div p` | Select all <p> elements inside a <div> element element>element | `div > p` | Select all <p> elements with <div> as a parent .class | `.title` | Select all elements with class="title" #id | `#name` | Select all elements with id="name" [attribute] | `[class]` | Select all elements with a class attribute [attribute=value] | `[class=title]` | Select all elements with class="title" ] --- ## Selecting tags ```{r} read_html(html) %>% html_nodes("p") ``` -- ```{r} read_html(html) %>% html_nodes("p") %>% html_text() ``` -- ```{r} read_html(html) %>% html_nodes("p") %>% html_attrs() ``` -- ```{r} read_html(html) %>% html_nodes("p") %>% html_attr("align") ``` --- ## More selecting tags ```{r} read_html(html) %>% html_nodes("div") ``` -- ```{r} read_html(html) %>% html_nodes("div") %>% html_text() ``` --- ## Nesting tags ```{r} read_html(html) %>% html_nodes("body div") ``` -- ```{r} read_html(html) %>% html_nodes("body>div") ``` -- ```{r} read_html(html) %>% html_nodes("body div div") ``` --- ## CSS, classes, and ids ```{r} read_html(html) %>% html_nodes(".name") ``` -- ```{r} read_html(html) %>% html_nodes("div.name") ``` -- ```{r} read_html(html) %>% html_nodes("#first") ``` --- ## Mixing it up ```{r} read_html(html) %>% html_nodes("[align]") ``` ```{r} read_html(html) %>% html_nodes(".contact div") ``` --- ## SelectorGadget This is a javascript based tool that helps you interactively build an appropriate CSS selector for the content you are interested in.

http://selectorgadget.com/
--- ## Exercise ### Step 1 For the movies listed in the **Top Box Office** list on `rottentomatoes.com` create a data frame with the Movies' titles, their weekend gross, their tomatometer score, and whether the movie is fresh or rotten.
### Step 2 Using the url for each movie, now go out and grab the average rating, number of reviews, number of fresh and rotten reviews as well as the audience score, average audience rating and number of user ratings.