Packages

library(tidyverse)
library(rvest)
library(jsonlite)
library(RSelenium)

Exercise 1

Problem

http://quotes.toscrape.com/ provides quotes for scraping. This site scraping sandox provides various endpoints that present different scraping challenges. Try and scrape the first 50 quotes and authors at http://quotes.toscrape.com/scroll.

First, try using the typical approach with rvest to understand what is going on.

Solution

get_quotes_scroll <- function(page) {
  base_url <- "http://quotes.toscrape.com/api/quotes?page="
  url <- str_c(base_url, page)
  
  x <- read_json(url)
  x$quotes
}

quotes <- map(1:5, get_quotes_scroll)

Exercise 2

Problem

http://quotes.toscrape.com/ provides quotes for scraping. This site scraping sandox provides various endpoints that present different scraping challenges. Try and scrape the first 50 quotes and authors at http://quotes.toscrape.com/js/.

First, try using the typical approach with rvest to understand what is going on.

Solution

First set-up an R Selenium driver.

driver <- rsDriver(browser=c("chrome"), 
                   chromever = "85.0.4183.87", port = 5112L)
remote_driver <- driver$client

Navigate to page 1 to test that our driver works.

quotes_js_url <- "http://quotes.toscrape.com/js/"
remote_driver$navigate(quotes_js_url)

Scrape the quotes and authors on page. Function getPageSource() will return a list with the HTML document as the first component.

quotes_html <- read_html(remote_driver$getPageSource()[[1]])

quotes <- quotes_html %>% 
  html_nodes(".text") %>% 
  html_text()

authors <- quotes_html %>% 
  html_nodes(".author") %>% 
  html_text

Turn the above code into a function for iterative page scraping.

get_quotes <- function(page) {
  base_url <- quotes_js_url
  url <- str_c(base_url, "page/", page, "/")
  remote_driver$navigate(url)
  
  quotes_html <- read_html(remote_driver$getPageSource()[[1]])

  quotes <- quotes_html %>% 
    html_nodes(".text") %>% 
    html_text()
  
  authors <- quotes_html %>% 
    html_nodes(".author") %>% 
    html_text
  
  tibble(
    quote  = quotes,
    author = authors
  )
}

Scrape pages 1 - 5 and create a tibble.

map_df(1:5, get_quotes)

Close the driver down.

netstat::free_port()
remote_driver$close()
rm(driver)
rm(remote_driver)