Scraping the web

Scraping the web: what? why?

  • Increasing amount of data is available on the web.
  • These data are provided in an unstructured format: you can always copy&paste, but it’s time-consuming and prone to errors.
  • Web scraping is the process of extracting this information automatically and transform it into a structured dataset.
  • Two different scenarios:
    • Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy).
    • Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files.
  • Why R? It includes all tools necessary to do web scraping, familiarity, direct analysis of data… But python, perl, java are probably more efficient tools.

Screen scraping

Top 250 movies on IMDB

Take a look at the source code, look for the tag table tag:
http://www.imdb.com/chart/top
imdb_top

Useful R libraries

  • XML: Tools for parsing and generating XML within R

  • stringr: Make it easier to work with strings

  • lubridate: Make dealing with dates a little easier

  • dplyr: A Grammar of Data Manipulation

library(XML)
library(stringr)
library(lubridate)
library(dplyr)

Read the web page source code - easy!

url = "http://www.imdb.com/chart/top"
imdb250 = readHTMLTable(url, which = 1, stringsAsFactors = FALSE)
imdb250 = imdb250[,2:3]
head(imdb250, n = 1)
##                                   Rank & Title IMDb Rating
## 1 1.\n    The Shawshank Redemption\n    (1994)         9.2

Clean up - not as easy…

… depending on how messy the data are

Fix columns

Fix the column names so they don’t have spaces

names(imdb250) = c("title", "rating")

Make rating numeric

imdb250$rating = as.numeric(as.character(imdb250$rating))

Add a column for rank

imdb250$rank = as.numeric(row.names(imdb250))
head(imdb250, n = 1)
##                                          title rating rank
## 1 1.\n    The Shawshank Redemption\n    (1994)    9.2    1

Parsing text

Extract title using str_extract from stringr

title = str_extract(imdb250$title, pattern = "\n.*\n")
title = gsub("\n","",title) # find and replace
title = str_trim(title) # remove leading and trailing spaces
title[1:5]
## [1] "The Shawshank Redemption" "The Godfather"           
## [3] "The Godfather: Part II"   "The Dark Knight"         
## [5] "Pulp Fiction"

More text parsing

year <- str_extract(imdb250$title, pattern = "\\([0-9]{4}\\)")
year <- gsub("\\(|\\)","", year)
year[1:5]
## [1] "1994" "1972" "1974" "2008" "1994"

Almost there…

imdb250$title = title
imdb250$year = as.numeric(year)
head(imdb250)
##                            title rating rank year
## 1       The Shawshank Redemption    9.2    1 1994
## 2                  The Godfather    9.2    2 1972
## 3         The Godfather: Part II    9.0    3 1974
## 4                The Dark Knight    8.9    4 2008
## 5                   Pulp Fiction    8.9    5 1994
## 6 The Good, the Bad and the Ugly    8.9    6 1966

Analyze

See which years have the most movies on the list:

imdb250 %>% 
  group_by(year) %>%
  summarise(total=n()) %>%
  arrange(desc(total)) %>%
  head(5)
## Source: local data frame [5 x 2]
## 
##   year total
## 1 1995    10
## 2 2001     8
## 3 1998     7
## 4 2004     7
## 5 1957     6

Analyze

See the 1995 movies

imdb250 %>% 
  filter(year==1995)
##                 title rating rank year
## 1               Se7en    8.6   23 1995
## 2  The Usual Suspects    8.6   24 1995
## 3          Braveheart    8.4   81 1995
## 4           Toy Story    8.3  110 1995
## 5                Heat    8.2  125 1995
## 6              Casino    8.2  142 1995
## 7      Twelve Monkeys    8.1  206 1995
## 8      Before Sunrise    8.0  218 1995
## 9            La Haine    8.0  223 1995
## 10        Underground    8.0  244 1995

Web APIs

The rules of the game

  • Respect the hosting site’s wishes:
    • Check if an API exists first, or if data are available for download.
    • Some websites “disallow” scrapers on their robots.txt files.
  • Limit your bandwidth use:
    • Wait one or more seconds after each hit
    • Try to scrape websites during off-peak hours
  • Scrape only what you need, and just once
    • When using APIs, read terms and conditions.
    • The fact that you can access some data doesn’t mean you should use it for your research. Be aware of rate limits.

Rotten tomatoes

Movies in theaters

  • Download in_theaters.json:

http://api.rottentomatoes.com/api/public/v1.0/lists/movies/in_theaters.json?apikey=[your_api_key]&page_limit=1

  • View the pretty version of the file

http://jsonformatter.curiousconcept.com/

  • Install and load the rjson package
library(rjson)

and read the JSON file

in_theaters = fromJSON(file = "in_theaters.JSON")

Movies in theaters (cont.)

Investigate the file format to figure out how best to parse it

class(in_theaters); names(in_theaters)
## [1] "list"
## [1] "total"         "movies"        "links"         "link_template"
class(in_theaters$movies); #str(in_theaters$movies)
## [1] "list"
in_theaters$movies[[1]]$title
## [1] "Big Hero 6"

Movies in theaters (cont.)

Convert the file into a data frame with 5 columns:

title, mpaa_rating, runtime, critics_score, audience_score

cells = rep(NA, 16)
d = data.frame(title = cells, mpaa_rating = cells, runtime = cells, 
               critics_score = cells, audience_score = cells)
for(i in 1:length(in_theaters$movies)){
  d$title[i] = in_theaters$movies[[i]]$title
  d$mpaa_rating[i] = in_theaters$movies[[i]]$mpaa_rating
  d$runtime[i] = in_theaters$movies[[i]]$runtime
  d$critics_score[i] = in_theaters$movies[[i]]$ratings$critics_score
  d$audience_score[i] = in_theaters$movies[[i]]$ratings$audience_score
}
head(d)
##                title mpaa_rating runtime critics_score audience_score
## 1         Big Hero 6          PG      93            89             94
## 2       Interstellar       PG-13     169            73             87
## 3 Dumb and Dumber To       PG-13     109            26             59
## 4              Ouija       PG-13      83             7             30
## 5               Fury           R     135            78             88
## 6        St. Vincent       PG-13     103            74             84

Movies in theaters (cont.)

Using the sapply function:

title = sapply(in_theaters$movies, function(x) x$title)
mpaa_rating = sapply(in_theaters$movies, function(x) x$mpaa_rating)
runtime = sapply(in_theaters$movies, function(x) x$runtime)
critics_score = sapply(in_theaters$movies, function(x) x$ratings$critics_score)
audience_score = sapply(in_theaters$movies, function(x) x$ratings$audience_score)
d2 = data.frame(cbind(title, mpaa_rating, runtime, critics_score, audience_score))
head(d2)
##                title mpaa_rating runtime critics_score audience_score
## 1         Big Hero 6          PG      93            89             94
## 2       Interstellar       PG-13     169            73             87
## 3 Dumb and Dumber To       PG-13     109            26             59
## 4              Ouija       PG-13      83             7             30
## 5               Fury           R     135            78             88
## 6        St. Vincent       PG-13     103            74             84

“All” movies

  • Someone on IMDB has created a list of “all” movies: http://www.imdb.com/list/ls057823854/

  • If you are logged in, you can export this list in CSV format, call it all.csv, and create an id column:

all$id = str_replace(samp$const,"^tt","")
  • These ids correspond to IDs on the Rotten Tomatoes API, so you could easily download JSON files for each movie (though you might actually want to do this for just a sample)

http://developer.rottentomatoes.com/docs/json/v10/Movie_Info

“All” movies (cont.)

Sample code for downloading JSON files for all movies:

key = "[your rotten tomatoes API key]"
for(i in 1:nrow(samp)){
  url1 = "http://api.rottentomatoes.com/api/public/v1.0/movie_alias.json?id="
  id = all$id[i]
  url2 = paste0("&type=imdb&apikey=", key)
  url = paste0(url1, id, url2)
  filename = paste0(i,"_",id,".json")
  download.file(url = url, destfile = filename)
  Sys.sleep(1) # one second break in each row
}

Then, using loops you can read in information from each file and store in a dataset.