Agenda

Screen scraping
Web APIs
Due Tuesday: HW 4

Scraping the web

Scraping the web: what? why?

Increasing amount of data is available on the web.
These data are provided in an unstructured format: you can always copy&paste, but it’s time-consuming and prone to errors.
Web scraping is the process of extracting this information automatically and transform it into a structured dataset.
Two different scenarios:
- Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy).
- Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files.
Why R? It includes all tools necessary to do web scraping, familiarity, direct analysis of data… But python, perl, java are also efficient tools.

Screen scraping

Top 250 movies on IMDB

Take a look at the source code, look for the tag table tag:
http://www.imdb.com/chart/top
imdb_top

Useful R libraries

rvest: Easily harvest (scrape) web pages
stringr: Make it easier to work with strings
dplyr: A Grammar of Data Manipulation

library(rvest)
library(stringr)
library(dplyr)

`rvest`

SelectorGadget

SelectorGadget: open source tool that makes CSS selector generation and discovery on complicated sites a breeze
Install the Chrome Extension
A box will open in the bottom right of the website. Click on a page element that you would like your selector to match (it will turn green). SelectorGadget will then generate a minimal CSS selector for that element, and will highlight (yellow) everything that is matched by the selector.
Now click on a highlighted element to remove it from the selector (red), or click on an unhighlighted element to add it to the selector. Through this process of selection and rejection, SelectorGadget helps you come up with the perfect CSS selector for your needs.

vignette("selectorgadget")

Select and format pieces

page <- read_html("http://www.imdb.com/chart/top")

titles <- page %>%
  html_nodes(".titleColumn a") %>%
  html_text()

years <- page %>%
  html_nodes(".secondaryInfo") %>%
  html_text() %>%
  str_replace("\\(", "") %>% # remove (
  str_replace("\\)", "") # remove )
  
scores <- page %>%
  html_nodes("strong") %>%
  html_text() %>%
  tail(-1) # remove first entry that is not a score

imdb_top_250 <- data.frame(title = titles, year = years, score = scores)
head(imdb_top_250)

##                      title year score
## 1 The Shawshank Redemption 1994   9.2
## 2            The Godfather 1972   9.2
## 3   The Godfather: Part II 1974   9.0
## 4          The Dark Knight 2008   8.9
## 5             12 Angry Men 1957   8.9
## 6         Schindler's List 1993   8.9

Clean up

May or may not be a lot of work depending on how messy the data are

Make score numeric

imdb_top_250$score = as.numeric(as.character(imdb_top_250$score))

Add a column for rank

imdb_top_250$rank = as.numeric(row.names(imdb_top_250))

head(imdb_top_250)

##                      title year score rank
## 1 The Shawshank Redemption 1994   9.2    1
## 2            The Godfather 1972   9.2    2
## 3   The Godfather: Part II 1974   9.0    3
## 4          The Dark Knight 2008   8.9    4
## 5             12 Angry Men 1957   8.9    5
## 6         Schindler's List 1993   8.9    6

Analyze

See which years have the most movies on the list:

imdb_top_250 %>% 
  group_by(year) %>%
  summarise(total=n()) %>%
  arrange(desc(total)) %>%
  head(5)

## Source: local data frame [5 x 2]
## 
##     year total
##   (fctr) (int)
## 1   1995     9
## 2   1957     7
## 3   2001     7
## 4   2003     7
## 5   2014     7

Analyze

See the 1995 movies

imdb_top_250 %>% 
  filter(year==1995)

##                title year score rank
## 1              Se7en 1995   8.6   22
## 2 The Usual Suspects 1995   8.6   24
## 3         Braveheart 1995   8.3   77
## 4          Toy Story 1995   8.3   96
## 5               Heat 1995   8.2  123
## 6             Casino 1995   8.2  139
## 7     Before Sunrise 1995   8.0  205
## 8     Twelve Monkeys 1995   8.0  206
## 9           La Haine 1995   8.0  223

Visualize

Plot yearly average scores

imdb_top_250 %>% 
  group_by(year) %>%
  summarise(avg_score = mean(score)) %>%
  ggplot(aes(y = avg_score, x = as.numeric(as.character(year)))) +
    geom_point() +
    geom_smooth(method = "lm") +
    xlab("year")

Potential challenges

Unreliable formatting at the source
Data broken into many pages
…

Discussion: https://raleigh.craigslist.org/search/apa

Web APIs

The rules of the game

Respect the hosting site’s wishes:
- Check if an API exists first, or if data are available for download.
- Some websites “disallow” scrapers on their robots.txt files.
Limit your bandwidth use:
- Wait one or more seconds after each hit
- Try to scrape websites during off-peak hours
Scrape only what you need, and just once
- When using APIs, read terms and conditions.
- The fact that you can access some data doesn’t mean you should use it for your research.
- Be aware of rate limits.

Technical details

Often requires creating an account to get an API key.
R packages like jsonlite will be useful for parsing the JSON data you retrieve.
We won’t focus on this in this class, but I’m happy to point you to resources.

Go (get some data) Blue Devils!

HW 4

https://stat.duke.edu/courses/Fall15/sta112.01/post/hw/HW4.html

Sta112FS
21. Scraping data off the web

Dr. Çetinkaya-Rundel

November 19, 2015

Agenda

Agenda

Scraping the web

Scraping the web: what? why?

Screen scraping

Top 250 movies on IMDB

Useful R libraries

`rvest`

SelectorGadget

Select and format pieces

Clean up

Analyze

Analyze

Visualize

Potential challenges

Web APIs

The rules of the game

Technical details

Go (get some data) Blue Devils!

HW 4

Sta112FS 21. Scraping data off the web

Dr. Çetinkaya-Rundel

November 19, 2015

Agenda

Agenda

Scraping the web

Scraping the web: what? why?

Screen scraping

Top 250 movies on IMDB

Useful R libraries

rvest

SelectorGadget

Select and format pieces

Clean up

Analyze

Analyze

Visualize

Potential challenges

Web APIs

The rules of the game

Technical details

Go (get some data) Blue Devils!

HW 4

Sta112FS
21. Scraping data off the web

`rvest`