Web Scraping with rvest

Hypertext Markup Language

Most of the data on the web is still largely available as HTML - while it is structured (hierarchical / tree based) it often is not available in a form useful for analysis (flat / tidy).

<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p align="center">Hello world!</p>
  </body>
</html>

rvest

rvest is a package from Hadley Wickham that makes basic processing and manipulation of HTML data straight forward.

Core functions:

read_html - read HTML data from a url or character string.
html_nodes - select specified nodes from the HTML document using CSS selectors.
html_table - parse an HTML table into a data frame.
html_text - extract tag pairs’ content.
html_name - extract tags’ names.
html_attrs - extract all of each tag’s attributes.
html_attr - extract tags’ attribute value by name.

css selectors

We will be using a tool called selector gadget to help up identify the html elements of interest - it does this by constructing a css selector which can be used to subset the html document.

Selector	Example	Description
element	`p`	Select all <p> elements
element element	`div p`	Select all <p> elements inside a <div> element
element>element	`div > p`	Select all <p> elements with <div> as a parent
.class	`.title`	Select all elements with class=“title”
#id	`.name`	Select all elements with id=“name”
[attribute]	`[class]`	Select all elements with a class attribute
[attribute=value]	`[class=title]`	Select all elements with class=“title”

Live Demo

… because rmarkdown hates rvest for whatever reason …

App Exercise - Part 2

For the first five movies in the Now Playing (Box Office) list on imdb.com create a data frame with the Movies’ titles, their weekend gross, and the url of their poster.

This will involve multiple steps:

Using the main imdb page find the title, gross, and the movie specific url within IMDB.
Examine each movie subpage to find the poster urls, the MPAA rating, movie run time, the user rating and metascore. Hint - the same approach should work for all five pages since imdb’s movie pages all have the same structure.