XML: Tools for parsing and generating XML within R
stringr: Make it easier to work with strings
lubridate: Make dealing with dates a little easier
dplyr: A Grammar of Data Manipulation
library(XML)
library(stringr)
library(lubridate)
library(dplyr)
url = "http://www.imdb.com/chart/top"
imdb250 = readHTMLTable(url, which = 1, stringsAsFactors = FALSE)
imdb250 = imdb250[,2:3]
head(imdb250, n = 1)
## Rank & Title IMDb Rating
## 1 1.\n The Shawshank Redemption\n (1994) 9.2
… depending on how messy the data are
Fix the column names so they don’t have spaces
names(imdb250) = c("title", "rating")
Make rating numeric
imdb250$rating = as.numeric(as.character(imdb250$rating))
Add a column for rank
imdb250$rank = as.numeric(row.names(imdb250))
head(imdb250, n = 1)
## title rating rank
## 1 1.\n The Shawshank Redemption\n (1994) 9.2 1
Extract title using str_extract from stringr
title = str_extract(imdb250$title, pattern = "\n.*\n")
title = gsub("\n","",title) # find and replace
title = str_trim(title) # remove leading and trailing spaces
title[1:5]
## [1] "The Shawshank Redemption" "The Godfather"
## [3] "The Godfather: Part II" "The Dark Knight"
## [5] "Pulp Fiction"
year <- str_extract(imdb250$title, pattern = "\\([0-9]{4}\\)")
year <- gsub("\\(|\\)","", year)
year[1:5]
## [1] "1994" "1972" "1974" "2008" "1994"
imdb250$title = title
imdb250$year = as.numeric(year)
head(imdb250)
## title rating rank year
## 1 The Shawshank Redemption 9.2 1 1994
## 2 The Godfather 9.2 2 1972
## 3 The Godfather: Part II 9.0 3 1974
## 4 The Dark Knight 8.9 4 2008
## 5 Pulp Fiction 8.9 5 1994
## 6 The Good, the Bad and the Ugly 8.9 6 1966
See which years have the most movies on the list:
imdb250 %>%
group_by(year) %>%
summarise(total=n()) %>%
arrange(desc(total)) %>%
head(5)
## Source: local data frame [5 x 2]
##
## year total
## 1 1995 10
## 2 2001 8
## 3 1998 7
## 4 2004 7
## 5 1957 6
See the 1995 movies
imdb250 %>%
filter(year==1995)
## title rating rank year
## 1 Se7en 8.6 23 1995
## 2 The Usual Suspects 8.6 24 1995
## 3 Braveheart 8.4 81 1995
## 4 Toy Story 8.3 110 1995
## 5 Heat 8.2 125 1995
## 6 Casino 8.2 142 1995
## 7 Twelve Monkeys 8.1 206 1995
## 8 Before Sunrise 8.0 218 1995
## 9 La Haine 8.0 223 1995
## 10 Underground 8.0 244 1995
Register + apply for an API: http://developer.rottentomatoes.com/apps/
http://api.rottentomatoes.com/api/public/v1.0/lists/movies/in_theaters.json?apikey=[your_api_key]&page_limit=1
http://jsonformatter.curiousconcept.com/
rjson packagelibrary(rjson)
and read the JSON file
in_theaters = fromJSON(file = "in_theaters.JSON")
Investigate the file format to figure out how best to parse it
class(in_theaters); names(in_theaters)
## [1] "list"
## [1] "total" "movies" "links" "link_template"
class(in_theaters$movies); #str(in_theaters$movies)
## [1] "list"
in_theaters$movies[[1]]$title
## [1] "Big Hero 6"
Convert the file into a data frame with 5 columns:
title, mpaa_rating, runtime, critics_score, audience_score
cells = rep(NA, 16)
d = data.frame(title = cells, mpaa_rating = cells, runtime = cells,
critics_score = cells, audience_score = cells)
for(i in 1:length(in_theaters$movies)){
d$title[i] = in_theaters$movies[[i]]$title
d$mpaa_rating[i] = in_theaters$movies[[i]]$mpaa_rating
d$runtime[i] = in_theaters$movies[[i]]$runtime
d$critics_score[i] = in_theaters$movies[[i]]$ratings$critics_score
d$audience_score[i] = in_theaters$movies[[i]]$ratings$audience_score
}
head(d)
## title mpaa_rating runtime critics_score audience_score
## 1 Big Hero 6 PG 93 89 94
## 2 Interstellar PG-13 169 73 87
## 3 Dumb and Dumber To PG-13 109 26 59
## 4 Ouija PG-13 83 7 30
## 5 Fury R 135 78 88
## 6 St. Vincent PG-13 103 74 84
Using the sapply function:
title = sapply(in_theaters$movies, function(x) x$title)
mpaa_rating = sapply(in_theaters$movies, function(x) x$mpaa_rating)
runtime = sapply(in_theaters$movies, function(x) x$runtime)
critics_score = sapply(in_theaters$movies, function(x) x$ratings$critics_score)
audience_score = sapply(in_theaters$movies, function(x) x$ratings$audience_score)
d2 = data.frame(cbind(title, mpaa_rating, runtime, critics_score, audience_score))
head(d2)
## title mpaa_rating runtime critics_score audience_score
## 1 Big Hero 6 PG 93 89 94
## 2 Interstellar PG-13 169 73 87
## 3 Dumb and Dumber To PG-13 109 26 59
## 4 Ouija PG-13 83 7 30
## 5 Fury R 135 78 88
## 6 St. Vincent PG-13 103 74 84
Someone on IMDB has created a list of “all” movies: http://www.imdb.com/list/ls057823854/
If you are logged in, you can export this list in CSV format, call it all.csv, and create an id column:
all$id = str_replace(samp$const,"^tt","")
ids correspond to IDs on the Rotten Tomatoes API, so you could easily download JSON files for each movie (though you might actually want to do this for just a sample)http://developer.rottentomatoes.com/docs/json/v10/Movie_Info
Sample code for downloading JSON files for all movies:
key = "[your rotten tomatoes API key]"
for(i in 1:nrow(samp)){
url1 = "http://api.rottentomatoes.com/api/public/v1.0/movie_alias.json?id="
id = all$id[i]
url2 = paste0("&type=imdb&apikey=", key)
url = paste0(url1, id, url2)
filename = paste0(i,"_",id,".json")
download.file(url = url, destfile = filename)
Sys.sleep(1) # one second break in each row
}
Then, using loops you can read in information from each file and store in a dataset.