Web scraping

# Web scraping
### Dr. Maria Tackett
### 04.08.19

---

<div class="my-footer">
<span>
<a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a>
</span>
</div>

---

## Announcements

- Extra credit due Mon, Apr 15 at 11:59p
    - [http://bit.ly/sta199-sp19-posttest](http://bit.ly/sta199-sp19-posttest)
    - Access Code: SSD2484ZHY
    - Course: STA 199, Section: 1
    - Up to 5 points on Exam 02. Must get at least 70% to receive any points

---

# Scraping the web

---

## Scraping the web: what? why?

- Increasing amount of data is available on the web.

- These data are provided in an unstructured format: you can always copy&paste, but it's time-consuming and prone to errors.

- Web scraping is the process of extracting this information automatically and transform it into a structured dataset.

- Two different scenarios:
    - <font class="vocab">Screen scraping</font>: extract data from source code of website, with html parser (easy) or 
    regular expression matching (less easy).
    - <font class="vocab">Web APIs (application programming interface)</font>: website offers a set of structured http 
    requests that return JSON or XML files.
    
---

# Web Scraping with rvest

---

### Hypertext Markup Language (HTML)

Most of the data on the web is still largely available as HTML - while it is structured (hierarchical / tree based) it often is not available in a form useful for analysis (flat / tidy).

```html
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p align="center">Hello world!</p>
  </body>
</html>
```

---

## rvest

.pull-left[
- **`rvest`** is a package that makes basic processing and manipulation of HTML data straight forward

- It's designed to work with pipelines built with `%>%`
]

---

## Core functions in rvest:

- **`read_html`** - read HTML data from a url or character string.

- **`html_nodes`** - select specified nodes from the HTML document using CSS selectors.

- **`html_table`** - parse an HTML table into a data frame.

- **`html_text`** - extract tag pairs' content.

- **`html_name`** - extract tags' names.

- **`html_attrs`** - extract all of each tag's attributes.

- **`html_attr`** - extract tags' attribute value by name.

---

## SelectorGadget

- <font class="vocab">SelectorGadget</font>: Open source tool that eases CSS selector generation and discovery

- Easiest to use with the [Chrome Extension](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb)

- Learn more on the [Selector Gadget Vignette](https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html)

---

## Using SelectorGadget

- Click on the app logo next to the search bar
- A box will open in the bottom right of the website. Click on a page element that you 
would like your selector to match (it will turn green). SelectorGadget will then generate 
a minimal CSS selector for that element, and will highlight (yellow) everything that is 
matched by the selector. 
- Now click on a highlighted element to remove it from the selector (red), or click on an 
unhighlighted element to add it to the selector. Through this process of selection and 
rejection, SelectorGadget helps you come up with the appropriate CSS selector for your needs.

---

# Top 250 movies on IMDB

---

## Top 250 movies on IMDB

Take a look at the source code, look for the tag `table` tag:
<br>
[http://www.imdb.com/chart/top](http://www.imdb.com/chart/top)

![imdb_top](img/12a/imdb_top_250.png)

---

### First check to make sure you're allowed!

```r
# install.packages("robotstxt")
library(robotstxt)
paths_allowed("http://www.imdb.com")
```

```
## 
 www.imdb.com                      No encoding supplied: defaulting to UTF-8.
```

```
## [1] TRUE
```

versus

```r
paths_allowed("http://www.facebook.com")
```

```
## 
 www.facebook.com
```

```
## [1] FALSE
```

---

## Demo

![imdb_top](img/12a/demo.png)

Go to [rstudio.cloud](https://rstudio.cloud/) `$\rightarrow$` Web scraping `$\rightarrow$` Make a copy `$\rightarrow$` `scrape-250.R`

---

## Select and format pieces

```r
library(rvest)

page <- read_html("http://www.imdb.com/chart/top")

titles <- page %>%
  html_nodes(".titleColumn a") %>%
  html_text()

years <- page %>%
  html_nodes(".secondaryInfo") %>%
  html_text() %>%
  str_replace("\$", "") %>% # remove (
  str_replace("\$", "") %>% # remove )
  as.numeric()

scores <- page %>%
  html_nodes("#main strong") %>%
  html_text() %>%
  as.numeric()
  
imdb_top_250 <- tibble(
  title = titles, 
  year = years, 
  score = scores
  )
```
]

---

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> title </th>
   <th style="text-align:left;"> year </th>
   <th style="text-align:left;"> score </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> The Shawshank Redemption </td>
   <td style="text-align:left;"> 1994 </td>
   <td style="text-align:left;"> 9.2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Godfather </td>
   <td style="text-align:left;"> 1972 </td>
   <td style="text-align:left;"> 9.2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Godfather: Part II </td>
   <td style="text-align:left;"> 1974 </td>
   <td style="text-align:left;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Dark Knight </td>
   <td style="text-align:left;"> 2008 </td>
   <td style="text-align:left;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 12 Angry Men </td>
   <td style="text-align:left;"> 1957 </td>
   <td style="text-align:left;"> 8.9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Schindler's List </td>
   <td style="text-align:left;"> 1993 </td>
   <td style="text-align:left;"> 8.9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Lord of the Rings: The Return of the King </td>
   <td style="text-align:left;"> 2003 </td>
   <td style="text-align:left;"> 8.9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Pulp Fiction </td>
   <td style="text-align:left;"> 1994 </td>
   <td style="text-align:left;"> 8.9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Good, the Bad and the Ugly </td>
   <td style="text-align:left;"> 1966 </td>
   <td style="text-align:left;"> 8.8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Fight Club </td>
   <td style="text-align:left;"> 1999 </td>
   <td style="text-align:left;"> 8.8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ... </td>
   <td style="text-align:left;"> ... </td>
   <td style="text-align:left;"> ... </td>
  </tr>
</tbody>
</table>

---

## Clean up / enhance

May or may not be a lot of work depending on how messy the data are

- See if you like what you got:

```r
glimpse(imdb_top_250)
```

```
## Observations: 250
## Variables: 3
## $ title <chr> "The Shawshank Redemption", "The Godfather", "The Godfathe…
## $ year  <dbl> 1994, 1972, 1974, 2008, 1957, 1993, 2003, 1994, 1966, 1999…
## $ score <dbl> 9.2, 9.2, 9.0, 9.0, 8.9, 8.9, 8.9, 8.9, 8.8, 8.8, 8.8, 8.7…
```

- Add a variable for rank

```r
imdb_top_250 <- imdb_top_250 %>%
  mutate(
    rank = 1:nrow(imdb_top_250)
  )
```

---

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> title </th>
   <th style="text-align:left;"> year </th>
   <th style="text-align:left;"> score </th>
   <th style="text-align:left;"> rank </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> The Shawshank Redemption </td>
   <td style="text-align:left;"> 1994 </td>
   <td style="text-align:left;"> 9.2 </td>
   <td style="text-align:left;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Godfather </td>
   <td style="text-align:left;"> 1972 </td>
   <td style="text-align:left;"> 9.2 </td>
   <td style="text-align:left;"> 2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Godfather: Part II </td>
   <td style="text-align:left;"> 1974 </td>
   <td style="text-align:left;"> 9 </td>
   <td style="text-align:left;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Dark Knight </td>
   <td style="text-align:left;"> 2008 </td>
   <td style="text-align:left;"> 9 </td>
   <td style="text-align:left;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 12 Angry Men </td>
   <td style="text-align:left;"> 1957 </td>
   <td style="text-align:left;"> 8.9 </td>
   <td style="text-align:left;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Schindler's List </td>
   <td style="text-align:left;"> 1993 </td>
   <td style="text-align:left;"> 8.9 </td>
   <td style="text-align:left;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Lord of the Rings: The Return of the King </td>
   <td style="text-align:left;"> 2003 </td>
   <td style="text-align:left;"> 8.9 </td>
   <td style="text-align:left;"> 7 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Pulp Fiction </td>
   <td style="text-align:left;"> 1994 </td>
   <td style="text-align:left;"> 8.9 </td>
   <td style="text-align:left;"> 8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> The Good, the Bad and the Ugly </td>
   <td style="text-align:left;"> 1966 </td>
   <td style="text-align:left;"> 8.8 </td>
   <td style="text-align:left;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Fight Club </td>
   <td style="text-align:left;"> 1999 </td>
   <td style="text-align:left;"> 8.8 </td>
   <td style="text-align:left;"> 10 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ... </td>
   <td style="text-align:left;"> ... </td>
   <td style="text-align:left;"> ... </td>
   <td style="text-align:left;"> ... </td>
  </tr>
</tbody>
</table>

---

## Analyze

---

```r
imdb_top_250 %>% 
  filter(year == 1995)
```

```
## # A tibble: 8 x 4
##   title               year score  rank
##   <chr>              <dbl> <dbl> <int>
## 1 Se7en               1995   8.6    20
## 2 The Usual Suspects  1995   8.5    28
## 3 Braveheart          1995   8.3    74
## 4 Toy Story           1995   8.3    88
## 5 Heat                1995   8.2   122
## 6 Casino              1995   8.2   143
## 7 Before Sunrise      1995   8.1   202
## 8 La Haine            1995   8     231
```

---

## Analyze

.question[
How would you go about answering this question: Which years have the most movies on the list?
]

```r
imdb_top_250 %>% 
  group_by(year) %>%
  summarise(total = n()) %>%
  arrange(desc(total)) %>%
  head(5)
```

```
## # A tibble: 5 x 2
##    year total
##   <dbl> <int>
## 1  1995     8
## 2  1957     7
## 3  2014     7
## 4  1994     6
## 5  1997     6
```

---

## Visualize

.question[
How would you go about creating this visualization: Visualize the average yearly score for movies that made it on the top 250 list over time.
]

## Potential challenges

- Unreliable formatting at the source
- Data broken into many pages
- ...

.question[
Compare the display of information at [raleigh.craigslist.org/search/apa](raleigh.craigslist.org/search/apa) to the list on the IMDB top 250 list.

What challenges can you foresee in scraping a list of the available apartments?
]

---

## Application Exercise

---

## Popular TV Shows

RStudio Cloud `$\rightarrow$` Web scraping

1. Scrape the list of most popular TV shows on IMDB: http://www.imdb.com/chart/tvmeter

2. Examine each of the first three (or however many you can get through) tv show subpage to also obtain genre and runtime.

3. Time permitting, also try to get the following:

- How many episodes so far
    - Certificate
    - First five plot keywords
    - Country
    - Language

Add this information to the data frame you created in step 1.