Web Scraping data, Pt. 1


Web Scraping with rvest

Hypertext Markup Language

Most of the data on the web is still largely available as HTML - while it is structured (hierarchical / tree based) it often is not available in a form useful for analysis (flat / tidy).

<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p align="center">Hello world!</p>
  </body>
</html>

rvest

rvest is a package from Hadley Wickham that makes basic processing and manipulation of HTML data straight forward.


Core functions:

  • read_html - read HTML data from a url or character string.

  • html_nodes - select specified nodes from the HTML document using CSS selectors.

  • html_table - parse an HTML table into a data frame.

  • html_text - extract tag pairs’ content.

  • html_name - extract tags’ names.

  • html_attrs - extract all of each tag’s attributes.

  • html_attr - extract tags’ attribute value by name.

css selectors

We will be using a tool called selector gadget to help up identify the html elements of interest - it does this by constructing a css selector which can be used to subset the html document.


Selector Example Description
element p Select all <p> elements
element element div p Select all <p> elements inside a <div> element
element>element div > p Select all <p> elements with <div> as a parent
.class .title Select all elements with class=“title”
#id .name Select all elements with id=“name”
[attribute] [class] Select all elements with a class attribute
[attribute=value] [class=title] Select all elements with class=“title”

Live Demo





… because rmarkdown hates rvest for whatever reason …

App Exercise - Part 2

For the first five movies in the Now Playing (Box Office) list on imdb.com create a data frame with the Movies’ titles, their weekend gross, and the url of their poster.

This will involve multiple steps:

  • Using the main imdb page find the title, gross, and the movie specific url within IMDB.

  • Examine each movie subpage to find the poster urls, the MPAA rating, movie run time, the user rating and metascore. Hint - the same approach should work for all five pages since imdb’s movie pages all have the same structure.


The only hard coded url you should be using is imdb.com.