Most of the data on the web is still largely available as HTML - while it is structured (hierarchical / tree based) it often is not available in a form useful for analysis (flat / tidy).
<html> <head> <title>This is a title</title> </head> <body> <p align="center">Hello world!</p> </body> </html>
rvest
is a package from Hadley Wickham that makes basic processing and manipulation of HTML data straight forward.
Core functions:
read_html
- read HTML data from a url or character string.
html_nodes
- select specified nodes from the HTML document usign CSS selectors.
html_table
- parse an HTML table into a data frame.
html_text
- extract tag pairs' content.
html_name
- extract tags' names.
html_attrs
- extract all of each tag's attributes.
html_attr
- extract tags' attribute value by name.
We will be using a tool called selector gadget to help up identify the html elements of interest - it does this by constructing a css selector which can be used to subset the html document.
Selector | Example | Description |
---|---|---|
element | p |
Select all <p> elements |
element element | div p |
Select all <p> elements inside a <div> element |
element>element | div > p |
Select all <p> elements with <div> as a parent |
.class | .title |
Select all elements with class="title" |
#id | .name |
Select all elements with id="name" |
[attribute] | [class] |
Select all elements with a class attribute |
[attribute=value] | [class=title] |
Select all elements with class="title" |
… because rmarkdown
hates rvest
for whatever reason …
For the first five movies in the Now Playing (Box Office) list on imdb.com
create a data frame with the Movies' titles, their weekend gross, and the url of their poster.
This will involve multiple steps:
Using the main imdb page find the title, gross, and the movie specific url within IMDB.
Examine each movie subpage to find the poster urls. Hint - the same approach should work for all five pages since imdb's movie pages all have the same structure.
The only hard coded url you should be using is imdb.com
.