Screen scraping
Web APIs
Due Tuesday: HW 4
Take a look at the source code, look for the tag table tag:
http://www.imdb.com/chart/top
rvest: Easily harvest (scrape) web pages
stringr: Make it easier to work with strings
dplyr: A Grammar of Data Manipulation
library(rvest)
library(stringr)
library(dplyr)rvestSelectorGadget: open source tool that makes CSS selector generation and discovery on complicated sites a breeze
Install the Chrome Extension
Now click on a highlighted element to remove it from the selector (red), or click on an unhighlighted element to add it to the selector. Through this process of selection and rejection, SelectorGadget helps you come up with the perfect CSS selector for your needs.
vignette("selectorgadget")page <- read_html("http://www.imdb.com/chart/top")
titles <- page %>%
html_nodes(".titleColumn a") %>%
html_text()
years <- page %>%
html_nodes(".secondaryInfo") %>%
html_text() %>%
str_replace("\\(", "") %>% # remove (
str_replace("\\)", "") # remove )
scores <- page %>%
html_nodes("strong") %>%
html_text() %>%
tail(-1) # remove first entry that is not a score
imdb_top_250 <- data.frame(title = titles, year = years, score = scores)
head(imdb_top_250)## title year score
## 1 The Shawshank Redemption 1994 9.2
## 2 The Godfather 1972 9.2
## 3 The Godfather: Part II 1974 9.0
## 4 The Dark Knight 2008 8.9
## 5 12 Angry Men 1957 8.9
## 6 Schindler's List 1993 8.9
May or may not be a lot of work depending on how messy the data are
imdb_top_250$score = as.numeric(as.character(imdb_top_250$score))imdb_top_250$rank = as.numeric(row.names(imdb_top_250))head(imdb_top_250)## title year score rank
## 1 The Shawshank Redemption 1994 9.2 1
## 2 The Godfather 1972 9.2 2
## 3 The Godfather: Part II 1974 9.0 3
## 4 The Dark Knight 2008 8.9 4
## 5 12 Angry Men 1957 8.9 5
## 6 Schindler's List 1993 8.9 6
See which years have the most movies on the list:
imdb_top_250 %>%
group_by(year) %>%
summarise(total=n()) %>%
arrange(desc(total)) %>%
head(5)## Source: local data frame [5 x 2]
##
## year total
## (fctr) (int)
## 1 1995 9
## 2 1957 7
## 3 2001 7
## 4 2003 7
## 5 2014 7
See the 1995 movies
imdb_top_250 %>%
filter(year==1995)## title year score rank
## 1 Se7en 1995 8.6 22
## 2 The Usual Suspects 1995 8.6 24
## 3 Braveheart 1995 8.3 77
## 4 Toy Story 1995 8.3 96
## 5 Heat 1995 8.2 123
## 6 Casino 1995 8.2 139
## 7 Before Sunrise 1995 8.0 205
## 8 Twelve Monkeys 1995 8.0 206
## 9 La Haine 1995 8.0 223
Plot yearly average scores
imdb_top_250 %>%
group_by(year) %>%
summarise(avg_score = mean(score)) %>%
ggplot(aes(y = avg_score, x = as.numeric(as.character(year)))) +
geom_point() +
geom_smooth(method = "lm") +
xlab("year")Discussion: https://raleigh.craigslist.org/search/apa
Often requires creating an account to get an API key.
R packages like jsonlite will be useful for parsing the JSON data you retrieve.
We won’t focus on this in this class, but I’m happy to point you to resources.