Most of the data on the web is still largely available as HTML - while it is structured (hierarchical / tree based) it often is not available in a form useful for analysis (flat / tidy).
<html>
<head>
<title>This is a title</title>
</head>
<body>
<p align="center">Hello world!</p>
</body>
</html>
rvest
is a package from Hadley Wickham that makes basic processing and manipulation of HTML data straight forward.
Core functions:
read_html
- read HTML data from a url or character string.
html_nodes
- select specified nodes from the HTML document using CSS selectors.
html_table
- parse an HTML table into a data frame.
html_text
- extract tag pairs’ content.
html_name
- extract tags’ names.
html_attrs
- extract all of each tag’s attributes.
html_attr
- extract tags’ attribute value by name.
We will be using a tool called selector gadget to help up identify the html elements of interest - it does this by constructing a css selector which can be used to subset the html document.
Selector | Example | Description |
---|---|---|
element | p |
Select all <p> elements |
element element | div p |
Select all <p> elements inside a <div> element |
element>element | div > p |
Select all <p> elements with <div> as a parent |
.class | .title |
Select all elements with class=“title” |
#id | .name |
Select all elements with id=“name” |
[attribute] | [class] |
Select all elements with a class attribute |
[attribute=value] | [class=title] |
Select all elements with class=“title” |
… because rmarkdown
hates rvest
for whatever reason …
For the first five movies in the Now Playing (Box Office) list on imdb.com
create a data frame with the Movies’ titles, their weekend gross, and the url of their poster.
This will involve multiple steps:
Using the main imdb page find the title, gross, and the movie specific url within IMDB.
Examine each movie subpage to find the poster urls, the MPAA rating, movie run time, the user rating and metascore. Hint - the same approach should work for all five pages since imdb’s movie pages all have the same structure.
The only hard coded url you should be using is imdb.com
.