Web scraping 🕸

# Web scraping <br> 🕸

---

layout: true
  
<div class="my-footer">
<span>
Dr. Mine Çetinkaya-Rundel -
<a href="http://www2.stat.duke.edu/courses/Fall18/sta112.01/schedule" target="_blank">stat.duke.edu/courses/Fall18/sta112.01
</a>
</span>
</div>

---

## Announcements

- Go to Sakai -> Tests & Quizzes -> MT 1 Reflection, complete by Thursday
- No IDC on Monday before Thanksgiving

---

# Scraping the web

---

## Scraping the web: what? why?

- Increasing amount of data is available on the web
--

- These data are provided in an unstructured format: you can always copy&paste, 
but it's time-consuming and prone to errors

--
- Web scraping is the process of extracting this information automatically and transform it into a structured dataset

--
- Two different scenarios:
    - Screen scraping: extract data from source code of website, with html 
    parser (easy) or regular expression matching (less easy).
    - Web APIs (application programming interface): website offers a set of 
    structured http requests that return JSON or XML files.

---

# Web Scraping with rvest

---

## Hypertext Markup Language

- Most of the data on the web is still largely available as HTML 
- It is structured (hierarchical / tree based), but it''s often not available in 
a form useful for analysis (flat / tidy).

```html
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p align="center">Hello world!</p>
  </body>
</html>
```

---

## rvest

.pull-left[
- The **rvest** package makes basic processing and manipulation of HTML data straight forward
- It's designed to work with pipelines built with `%>%`
]
.pull-right[
<img src="img/rvest.png" width="230" style="display: block; margin: auto 0 auto auto;" />
]

---

## Core rvest functions

- `read_html`   - Read HTML data from a url or character string
- `html_node `  - Select a specified node from HTML document
- `html_nodes`  - Select specified nodes from HTML document
- `html_table`  - Parse an HTML table into a data frame
- `html_text`   - Extract tag pairs' content
- `html_name`   - Extract tags' names
- `html_attrs`  - Extract all of each tag's attributes
- `html_attr`   - Extract tags' attribute value by name

---

## SelectorGadget

.pull-left[
- Open source tool that eases CSS selector generation and discovery
- Easiest to use with the [Chrome Extension](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) 
- Find out more on the [SelectorGadget vignette](https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html)
]
.pull-right[
<img src="img/selector-gadget.png" width="456" />
]

---

## Using the SelectorGadget

.pull-left[
- Click on the app logo next to the search bar
- A box will open in the bottom right of the website
]
.pull-right[
<img src="img/selector-gadget.gif" height="250" style="display: block; margin: auto;" />
]

- Click on a page element (it will turn green), SelectorGadget will generate a 
minimal CSS selector for that element, and will highlight (yellow) everything 
that is matched by the selector

--
- Click on a highlighted element to remove it from the selector (red), or 
click on an unhighlighted element to add it to the selector

--
- Through this process of selection and rejection, SelectorGadget helps you come 
up with the appropriate CSS selector for your needs

---

# Top 250 movies on IMDB

---

## Top 250 movies on IMDB

Take a look at the source code, look for the tag `table` tag:
<br>
http://www.imdb.com/chart/top

![imdb_top](img/imdb_top_250.png)

---

## First check if you're allowed!

```r
library(robotstxt)
paths_allowed("http://www.imdb.com")
```

```
## 
 www.imdb.com                      No encoding supplied: defaulting to UTF-8.
```

```
## [1] TRUE
```

vs. e.g.

```r
paths_allowed("http://www.facebook.com")
```

```
## 
 www.facebook.com
```

```
## [1] FALSE
```

---

## Demo

.center[
Go to [rstudio.cloud](https://rstudio.cloud/spaces/3518/projects)  
Make a copy of the project titled *Demo - Web scraping*  
Open `scrape-250.R`
]

---

## Select and format pieces

```r
page <- read_html("http://www.imdb.com/chart/top")

titles <- page %>%
  html_nodes(".titleColumn a") %>%
  html_text()

years <- page %>%
  html_nodes(".secondaryInfo") %>%
  html_text() %>%
  str_replace("\$", "") %>% # remove (
  str_replace("\$", "") %>% # remove )
  as.numeric()

scores <- page %>%
  html_nodes("#main strong") %>%
  html_text() %>%
  as.numeric()
  
imdb_top_250 <- tibble(
  title = titles, 
  year = years, 
  score = scores
  )
```
]

---

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> title </th>
   <th style="text-align:left;"> year </th>
   <th style="text-align:left;"> score </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> The Shawshank Redemption </td>
   <td style="text-align:left;"> 1994 </td>
   <td style="text-align:left;"> 9.2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Godfather </td>
   <td style="text-align:left;"> 1972 </td>
   <td style="text-align:left;"> 9.2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Godfather: Part II </td>
   <td style="text-align:left;"> 1974 </td>
   <td style="text-align:left;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Dark Knight </td>
   <td style="text-align:left;"> 2008 </td>
   <td style="text-align:left;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 12 Angry Men </td>
   <td style="text-align:left;"> 1957 </td>
   <td style="text-align:left;"> 8.9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Schindler's List </td>
   <td style="text-align:left;"> 1993 </td>
   <td style="text-align:left;"> 8.9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Lord of the Rings: The Return of the King </td>
   <td style="text-align:left;"> 2003 </td>
   <td style="text-align:left;"> 8.9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Pulp Fiction </td>
   <td style="text-align:left;"> 1994 </td>
   <td style="text-align:left;"> 8.9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Good, the Bad and the Ugly </td>
   <td style="text-align:left;"> 1966 </td>
   <td style="text-align:left;"> 8.8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Fight Club </td>
   <td style="text-align:left;"> 1999 </td>
   <td style="text-align:left;"> 8.8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Lord of the Rings: The Fellowship of the Ring </td>
   <td style="text-align:left;"> 2001 </td>
   <td style="text-align:left;"> 8.8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Forrest Gump </td>
   <td style="text-align:left;"> 1994 </td>
   <td style="text-align:left;"> 8.7 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Star Wars: Episode V - The Empire Strikes Back </td>
   <td style="text-align:left;"> 1980 </td>
   <td style="text-align:left;"> 8.7 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Inception </td>
   <td style="text-align:left;"> 2010 </td>
   <td style="text-align:left;"> 8.7 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Lord of the Rings: The Two Towers </td>
   <td style="text-align:left;"> 2002 </td>
   <td style="text-align:left;"> 8.7 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ... </td>
   <td style="text-align:left;"> ... </td>
   <td style="text-align:left;"> ... </td>
  </tr>
</tbody>
</table>

---

## Clean up / enhance

May or may not be a lot of work depending on how messy the data are

- See if you like what you got:

```r
glimpse(imdb_top_250)
```

```
## Observations: 250
## Variables: 3
## $ title <chr> "The Shawshank Redemption", "The Godfather", "The Godfat...
## $ year  <dbl> 1994, 1972, 1974, 2008, 1957, 1993, 2003, 1994, 1966, 19...
## $ score <dbl> 9.2, 9.2, 9.0, 9.0, 8.9, 8.9, 8.9, 8.9, 8.8, 8.8, 8.8, 8...
```

- Add a variable for rank

```r
imdb_top_250 <- imdb_top_250 %>%
  mutate(
    rank = 1:nrow(imdb_top_250)
  )
```

---

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> title </th>
   <th style="text-align:left;"> year </th>
   <th style="text-align:left;"> score </th>
   <th style="text-align:left;"> rank </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> The Shawshank Redemption </td>
   <td style="text-align:left;"> 1994 </td>
   <td style="text-align:left;"> 9.2 </td>
   <td style="text-align:left;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Godfather </td>
   <td style="text-align:left;"> 1972 </td>
   <td style="text-align:left;"> 9.2 </td>
   <td style="text-align:left;"> 2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Godfather: Part II </td>
   <td style="text-align:left;"> 1974 </td>
   <td style="text-align:left;"> 9 </td>
   <td style="text-align:left;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Dark Knight </td>
   <td style="text-align:left;"> 2008 </td>
   <td style="text-align:left;"> 9 </td>
   <td style="text-align:left;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 12 Angry Men </td>
   <td style="text-align:left;"> 1957 </td>
   <td style="text-align:left;"> 8.9 </td>
   <td style="text-align:left;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Schindler's List </td>
   <td style="text-align:left;"> 1993 </td>
   <td style="text-align:left;"> 8.9 </td>
   <td style="text-align:left;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Lord of the Rings: The Return of the King </td>
   <td style="text-align:left;"> 2003 </td>
   <td style="text-align:left;"> 8.9 </td>
   <td style="text-align:left;"> 7 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Pulp Fiction </td>
   <td style="text-align:left;"> 1994 </td>
   <td style="text-align:left;"> 8.9 </td>
   <td style="text-align:left;"> 8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Good, the Bad and the Ugly </td>
   <td style="text-align:left;"> 1966 </td>
   <td style="text-align:left;"> 8.8 </td>
   <td style="text-align:left;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Fight Club </td>
   <td style="text-align:left;"> 1999 </td>
   <td style="text-align:left;"> 8.8 </td>
   <td style="text-align:left;"> 10 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Lord of the Rings: The Fellowship of the Ring </td>
   <td style="text-align:left;"> 2001 </td>
   <td style="text-align:left;"> 8.8 </td>
   <td style="text-align:left;"> 11 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Forrest Gump </td>
   <td style="text-align:left;"> 1994 </td>
   <td style="text-align:left;"> 8.7 </td>
   <td style="text-align:left;"> 12 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Star Wars: Episode V - The Empire Strikes Back </td>
   <td style="text-align:left;"> 1980 </td>
   <td style="text-align:left;"> 8.7 </td>
   <td style="text-align:left;"> 13 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Inception </td>
   <td style="text-align:left;"> 2010 </td>
   <td style="text-align:left;"> 8.7 </td>
   <td style="text-align:left;"> 14 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Lord of the Rings: The Two Towers </td>
   <td style="text-align:left;"> 2002 </td>
   <td style="text-align:left;"> 8.7 </td>
   <td style="text-align:left;"> 15 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ... </td>
   <td style="text-align:left;"> ... </td>
   <td style="text-align:left;"> ... </td>
   <td style="text-align:left;"> ... </td>
  </tr>
</tbody>
</table>

---

## Analyze

```r
imdb_top_250 %>% 
  filter(year == 1995)
```

```
## # A tibble: 8 x 4
##   title               year score  rank
##   <chr>              <dbl> <dbl> <int>
## 1 Se7en               1995   8.6    21
## 2 The Usual Suspects  1995   8.5    26
## 3 Braveheart          1995   8.3    73
## 4 Toy Story           1995   8.3    91
## 5 Heat                1995   8.2   122
## 6 Casino              1995   8.2   143
## 7 Before Sunrise      1995   8.1   204
## 8 La Haine            1995   8     229
```

---

## Analyze

.question[
How would you go about answering this question: Which years have the most movies on the list?
]

```r
imdb_top_250 %>% 
  group_by(year) %>%
  summarise(total = n()) %>%
  arrange(desc(total)) %>%
  head(5)
```

```
## # A tibble: 5 x 2
##    year total
##   <dbl> <int>
## 1  1995     8
## 2  1957     7
## 3  2000     6
## 4  2001     6
## 5  2003     6
```

---

## Visualize

.question[
How would you go about creating this visualization: Visualize the average yearly score for movies that made it on the top 250 list over time.
]

## Potential challenges

- Unreliable formatting at the source
- Data broken into many pages
- ...

.question[
Compare the display of information at [raleigh.craigslist.org/search/apa](https://raleigh.craigslist.org/search/apa) to the list on the IMDB top 250 list. What challenges can you foresee in scraping a list of the available apartments?
]

---

# Application exercise

---

## <i class="fas fa-laptop"></i> AE 07 - Web scraping

- Clone your assignment repo in RStudio Cloud (`ae-07-web-scraping-TEAMNAME`)
- Open the R script called `scrape-tvshows.R`
- Scrape the names, scores, and years of most popular TV shows on IMDB:
[www.imdb.com/chart/tvmeter](http://www.imdb.com/chart/tvmeter)
- Create a data frame called `tvshows` with four variables 
(`rank`, `name`, `score`, `year`)  
- Examine each of the **first three** TV shows to also obtain 
  - Genre
  - Runtime
  - How many episodes so far
  - First five plot keywords
- Add this information to the `tvshows` data frame you created earlier