library(tidyverse)
library(rvest)
library(jsonlite)
library(RSelenium)
http://quotes.toscrape.com/ provides quotes for scraping. This site scraping sandox provides various endpoints that present different scraping challenges. Try and scrape the first 50 quotes and authors at http://quotes.toscrape.com/scroll.
First, try using the typical approach with rvest
to understand what is going on.
get_quotes_scroll <- function(page) {
base_url <- "http://quotes.toscrape.com/api/quotes?page="
url <- str_c(base_url, page)
x <- read_json(url)
x$quotes
}
quotes <- map(1:5, get_quotes_scroll)
http://quotes.toscrape.com/ provides quotes for scraping. This site scraping sandox provides various endpoints that present different scraping challenges. Try and scrape the first 50 quotes and authors at http://quotes.toscrape.com/js/.
First, try using the typical approach with rvest
to understand what is going on.
First set-up an R Selenium driver.
driver <- rsDriver(browser=c("chrome"),
chromever = "85.0.4183.87", port = 5112L)
remote_driver <- driver$client
Navigate to page 1 to test that our driver works.
quotes_js_url <- "http://quotes.toscrape.com/js/"
remote_driver$navigate(quotes_js_url)
Scrape the quotes and authors on page. Function getPageSource()
will return a list with the HTML document as the first component.
quotes_html <- read_html(remote_driver$getPageSource()[[1]])
quotes <- quotes_html %>%
html_nodes(".text") %>%
html_text()
authors <- quotes_html %>%
html_nodes(".author") %>%
html_text
Turn the above code into a function for iterative page scraping.
get_quotes <- function(page) {
base_url <- quotes_js_url
url <- str_c(base_url, "page/", page, "/")
remote_driver$navigate(url)
quotes_html <- read_html(remote_driver$getPageSource()[[1]])
quotes <- quotes_html %>%
html_nodes(".text") %>%
html_text()
authors <- quotes_html %>%
html_nodes(".author") %>%
html_text
tibble(
quote = quotes,
author = authors
)
}
Scrape pages 1 - 5 and create a tibble.
map_df(1:5, get_quotes)
Close the driver down.
netstat::free_port()
remote_driver$close()
rm(driver)
rm(remote_driver)