Agenda

Agenda

  • Screen scraping

  • Web APIs

  • Due Tuesday: HW 4

Scraping the web

Scraping the web: what? why?

  • Increasing amount of data is available on the web.
  • These data are provided in an unstructured format: you can always copy&paste, but it’s time-consuming and prone to errors.
  • Web scraping is the process of extracting this information automatically and transform it into a structured dataset.
  • Two different scenarios:
    • Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy).
    • Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files.
  • Why R? It includes all tools necessary to do web scraping, familiarity, direct analysis of data… But python, perl, java are also efficient tools.

Screen scraping

Top 250 movies on IMDB

Take a look at the source code, look for the tag table tag:
http://www.imdb.com/chart/top
imdb_top

Useful R libraries

  • rvest: Easily harvest (scrape) web pages

  • stringr: Make it easier to work with strings

  • dplyr: A Grammar of Data Manipulation

library(rvest)
library(stringr)
library(dplyr)

rvest

SelectorGadget

  • SelectorGadget: open source tool that makes CSS selector generation and discovery on complicated sites a breeze

  • Install the Chrome Extension

  • A box will open in the bottom right of the website. Click on a page element that you would like your selector to match (it will turn green). SelectorGadget will then generate a minimal CSS selector for that element, and will highlight (yellow) everything that is matched by the selector.
  • Now click on a highlighted element to remove it from the selector (red), or click on an unhighlighted element to add it to the selector. Through this process of selection and rejection, SelectorGadget helps you come up with the perfect CSS selector for your needs.

vignette("selectorgadget")

Select and format pieces

page <- read_html("http://www.imdb.com/chart/top")

titles <- page %>%
  html_nodes(".titleColumn a") %>%
  html_text()

years <- page %>%
  html_nodes(".secondaryInfo") %>%
  html_text() %>%
  str_replace("\\(", "") %>% # remove (
  str_replace("\\)", "") # remove )
  
scores <- page %>%
  html_nodes("strong") %>%
  html_text() %>%
  tail(-1) # remove first entry that is not a score

imdb_top_250 <- data.frame(title = titles, year = years, score = scores)
head(imdb_top_250)
##                      title year score
## 1 The Shawshank Redemption 1994   9.2
## 2            The Godfather 1972   9.2
## 3   The Godfather: Part II 1974   9.0
## 4          The Dark Knight 2008   8.9
## 5             12 Angry Men 1957   8.9
## 6         Schindler's List 1993   8.9

Clean up

May or may not be a lot of work depending on how messy the data are

  • Make score numeric
imdb_top_250$score = as.numeric(as.character(imdb_top_250$score))
  • Add a column for rank
imdb_top_250$rank = as.numeric(row.names(imdb_top_250))
head(imdb_top_250)
##                      title year score rank
## 1 The Shawshank Redemption 1994   9.2    1
## 2            The Godfather 1972   9.2    2
## 3   The Godfather: Part II 1974   9.0    3
## 4          The Dark Knight 2008   8.9    4
## 5             12 Angry Men 1957   8.9    5
## 6         Schindler's List 1993   8.9    6

Analyze

See which years have the most movies on the list:

imdb_top_250 %>% 
  group_by(year) %>%
  summarise(total=n()) %>%
  arrange(desc(total)) %>%
  head(5)
## Source: local data frame [5 x 2]
## 
##     year total
##   (fctr) (int)
## 1   1995     9
## 2   1957     7
## 3   2001     7
## 4   2003     7
## 5   2014     7

Analyze

See the 1995 movies

imdb_top_250 %>% 
  filter(year==1995)
##                title year score rank
## 1              Se7en 1995   8.6   22
## 2 The Usual Suspects 1995   8.6   24
## 3         Braveheart 1995   8.3   77
## 4          Toy Story 1995   8.3   96
## 5               Heat 1995   8.2  123
## 6             Casino 1995   8.2  139
## 7     Before Sunrise 1995   8.0  205
## 8     Twelve Monkeys 1995   8.0  206
## 9           La Haine 1995   8.0  223

Visualize

Plot yearly average scores

imdb_top_250 %>% 
  group_by(year) %>%
  summarise(avg_score = mean(score)) %>%
  ggplot(aes(y = avg_score, x = as.numeric(as.character(year)))) +
    geom_point() +
    geom_smooth(method = "lm") +
    xlab("year")

Potential challenges

  • Unreliable formatting at the source
  • Data broken into many pages

Discussion: https://raleigh.craigslist.org/search/apa

Web APIs

The rules of the game

  • Respect the hosting site’s wishes:
    • Check if an API exists first, or if data are available for download.
    • Some websites “disallow” scrapers on their robots.txt files.
  • Limit your bandwidth use:
    • Wait one or more seconds after each hit
    • Try to scrape websites during off-peak hours
  • Scrape only what you need, and just once
    • When using APIs, read terms and conditions.
    • The fact that you can access some data doesn’t mean you should use it for your research.
    • Be aware of rate limits.

Technical details

  • Often requires creating an account to get an API key.

  • R packages like jsonlite will be useful for parsing the JSON data you retrieve.

  • We won’t focus on this in this class, but I’m happy to point you to resources.

Go (get some data) Blue Devils!