Scraping the web

Scraping the web: what? why?

Increasing amount of data is available on the web.
These data are provided in an unstructured format: you can always copy&paste, but it’s time-consuming and prone to errors.
Web scraping is the process of extracting this information automatically and transform it into a structured dataset.
Two different scenarios:
- Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy).
- Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files.
Why R? It includes all tools necessary to do web scraping, familiarity, direct analysis of data… But python, perl, java are probably more efficient tools.

Screen scraping

Top 250 movies on IMDB

Take a look at the source code, look for the tag table tag:
http://www.imdb.com/chart/top
imdb_top

Useful R libraries

XML: Tools for parsing and generating XML within R
stringr: Make it easier to work with strings
lubridate: Make dealing with dates a little easier
dplyr: A Grammar of Data Manipulation

library(XML)
library(stringr)
library(lubridate)
library(dplyr)

Read the web page source code - easy!

url = "http://www.imdb.com/chart/top"
imdb250 = readHTMLTable(url, which = 1, stringsAsFactors = FALSE)
imdb250 = imdb250[,2:3]
head(imdb250, n = 1)

##                                   Rank & Title IMDb Rating
## 1 1.\n    The Shawshank Redemption\n    (1994)         9.2

Clean up - not as easy…

… depending on how messy the data are

Fix columns

Fix the column names so they don’t have spaces

names(imdb250) = c("title", "rating")

Make rating numeric

imdb250$rating = as.numeric(as.character(imdb250$rating))

Add a column for rank

imdb250$rank = as.numeric(row.names(imdb250))

head(imdb250, n = 1)

##                                          title rating rank
## 1 1.\n    The Shawshank Redemption\n    (1994)    9.2    1

Parsing text

Extract title using str_extract from stringr

title = str_extract(imdb250$title, pattern = "\n.*\n")
title = gsub("\n","",title) # find and replace
title = str_trim(title) # remove leading and trailing spaces
title[1:5]

## [1] "The Shawshank Redemption" "The Godfather"           
## [3] "The Godfather: Part II"   "The Dark Knight"         
## [5] "Pulp Fiction"

More text parsing

year <- str_extract(imdb250$title, pattern = "\\([0-9]{4}\\)")
year <- gsub("\\(|\\)","", year)
year[1:5]

## [1] "1994" "1972" "1974" "2008" "1994"

Almost there…

imdb250$title = title
imdb250$year = as.numeric(year)
head(imdb250)

##                            title rating rank year
## 1       The Shawshank Redemption    9.2    1 1994
## 2                  The Godfather    9.2    2 1972
## 3         The Godfather: Part II    9.0    3 1974
## 4                The Dark Knight    8.9    4 2008
## 5                   Pulp Fiction    8.9    5 1994
## 6 The Good, the Bad and the Ugly    8.9    6 1966

Analyze

See which years have the most movies on the list:

imdb250 %>% 
  group_by(year) %>%
  summarise(total=n()) %>%
  arrange(desc(total)) %>%
  head(5)

## Source: local data frame [5 x 2]
## 
##   year total
## 1 1995    10
## 2 2001     8
## 3 1998     7
## 4 2004     7
## 5 1957     6

Analyze

See the 1995 movies

imdb250 %>% 
  filter(year==1995)

##                 title rating rank year
## 1               Se7en    8.6   23 1995
## 2  The Usual Suspects    8.6   24 1995
## 3          Braveheart    8.4   81 1995
## 4           Toy Story    8.3  110 1995
## 5                Heat    8.2  125 1995
## 6              Casino    8.2  142 1995
## 7      Twelve Monkeys    8.1  206 1995
## 8      Before Sunrise    8.0  218 1995
## 9            La Haine    8.0  223 1995
## 10        Underground    8.0  244 1995

Web APIs

The rules of the game

Respect the hosting site’s wishes:
- Check if an API exists first, or if data are available for download.
- Some websites “disallow” scrapers on their robots.txt files.
Limit your bandwidth use:
- Wait one or more seconds after each hit
- Try to scrape websites during off-peak hours
Scrape only what you need, and just once
- When using APIs, read terms and conditions.
- The fact that you can access some data doesn’t mean you should use it for your research. Be aware of rate limits.

Rotten tomatoes

Register + apply for an API: http://developer.rottentomatoes.com/apps/
Read through the JSON documentation: http://developer.rottentomatoes.com/docs/read/JSON
- JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate.

Movies in theaters

Download in_theaters.json:

http://api.rottentomatoes.com/api/public/v1.0/lists/movies/in_theaters.json?apikey=[your_api_key]&page_limit=1

View the pretty version of the file

http://jsonformatter.curiousconcept.com/

Install and load the rjson package

library(rjson)

and read the JSON file

in_theaters = fromJSON(file = "in_theaters.JSON")

Movies in theaters (cont.)

Investigate the file format to figure out how best to parse it

class(in_theaters); names(in_theaters)

## [1] "list"

## [1] "total"         "movies"        "links"         "link_template"

class(in_theaters$movies); #str(in_theaters$movies)

## [1] "list"

in_theaters$movies[[1]]$title

## [1] "Big Hero 6"

Movies in theaters (cont.)

Convert the file into a data frame with 5 columns:

title, mpaa_rating, runtime, critics_score, audience_score

cells = rep(NA, 16)
d = data.frame(title = cells, mpaa_rating = cells, runtime = cells, 
               critics_score = cells, audience_score = cells)
for(i in 1:length(in_theaters$movies)){
  d$title[i] = in_theaters$movies[[i]]$title
  d$mpaa_rating[i] = in_theaters$movies[[i]]$mpaa_rating
  d$runtime[i] = in_theaters$movies[[i]]$runtime
  d$critics_score[i] = in_theaters$movies[[i]]$ratings$critics_score
  d$audience_score[i] = in_theaters$movies[[i]]$ratings$audience_score
}
head(d)

##                title mpaa_rating runtime critics_score audience_score
## 1         Big Hero 6          PG      93            89             94
## 2       Interstellar       PG-13     169            73             87
## 3 Dumb and Dumber To       PG-13     109            26             59
## 4              Ouija       PG-13      83             7             30
## 5               Fury           R     135            78             88
## 6        St. Vincent       PG-13     103            74             84

Movies in theaters (cont.)

Using the sapply function:

title = sapply(in_theaters$movies, function(x) x$title)
mpaa_rating = sapply(in_theaters$movies, function(x) x$mpaa_rating)
runtime = sapply(in_theaters$movies, function(x) x$runtime)
critics_score = sapply(in_theaters$movies, function(x) x$ratings$critics_score)
audience_score = sapply(in_theaters$movies, function(x) x$ratings$audience_score)
d2 = data.frame(cbind(title, mpaa_rating, runtime, critics_score, audience_score))
head(d2)

##                title mpaa_rating runtime critics_score audience_score
## 1         Big Hero 6          PG      93            89             94
## 2       Interstellar       PG-13     169            73             87
## 3 Dumb and Dumber To       PG-13     109            26             59
## 4              Ouija       PG-13      83             7             30
## 5               Fury           R     135            78             88
## 6        St. Vincent       PG-13     103            74             84

“All” movies

Someone on IMDB has created a list of “all” movies: http://www.imdb.com/list/ls057823854/
If you are logged in, you can export this list in CSV format, call it all.csv, and create an id column:

all$id = str_replace(samp$const,"^tt","")

These ids correspond to IDs on the Rotten Tomatoes API, so you could easily download JSON files for each movie (though you might actually want to do this for just a sample)

http://developer.rottentomatoes.com/docs/json/v10/Movie_Info

“All” movies (cont.)

Sample code for downloading JSON files for all movies:

key = "[your rotten tomatoes API key]"
for(i in 1:nrow(samp)){
  url1 = "http://api.rottentomatoes.com/api/public/v1.0/movie_alias.json?id="
  id = all$id[i]
  url2 = paste0("&type=imdb&apikey=", key)
  url = paste0(url1, id, url2)
  filename = paste0(i,"_",id,".json")
  download.file(url = url, destfile = filename)
  Sys.sleep(1) # one second break in each row
}

Then, using loops you can read in information from each file and store in a dataset.

Sta112FS
Lecture 23 -
Scraping data off the web

Dr. Çetinkaya-Rundel

November 18, 2014

Scraping the web

Scraping the web: what? why?

Screen scraping

Top 250 movies on IMDB

Useful R libraries

Read the web page source code - easy!

Clean up - not as easy…

Fix columns

Parsing text

More text parsing

Almost there…

Analyze

Analyze

Web APIs

The rules of the game

Rotten tomatoes

Movies in theaters

Movies in theaters (cont.)

Movies in theaters (cont.)

Movies in theaters (cont.)

“All” movies

“All” movies (cont.)

Sta112FS Lecture 23 - Scraping data off the web

Dr. Çetinkaya-Rundel

November 18, 2014

Scraping the web

Scraping the web: what? why?

Screen scraping

Top 250 movies on IMDB

Useful R libraries

Read the web page source code - easy!

Clean up - not as easy…

Fix columns

Parsing text

More text parsing

Almost there…

Analyze

Analyze

Web APIs

The rules of the game

Rotten tomatoes

Movies in theaters

Movies in theaters (cont.)

Movies in theaters (cont.)

Movies in theaters (cont.)

“All” movies

“All” movies (cont.)

Sta112FS
Lecture 23 -
Scraping data off the web