Packages

library(tidyverse)
library(rvest)

Exercise 1

Problem

Go to http://books.toscrape.com/catalogue/page-1.html and scrape the first five pages of data on books with regards to their

title
price
star rating
availability

Organize your results in a neatly formatted tibble similar to below.

Solution

get_books <- function(page) {
  
  base_url <- "http://books.toscrape.com/catalogue/page-"
  url <- str_c(base_url, page, ".html")
  
  books_html <- read_html(url)
  
  prices <- books_html %>% 
    html_nodes(css = ".price_color") %>% 
    html_text()
  
  titles <- books_html %>% 
    html_nodes(css = ".product_pod a") %>% 
    html_attr("title") %>% 
    .[!is.na(.)]

  ratings <- books_html %>% 
    html_nodes(css = ".star-rating") %>% 
    html_attr(name = "class") %>% 
    str_remove(pattern = "star-rating ")
  
  availabilities <- books_html %>% 
    html_nodes(css = ".availability") %>% 
    html_text() %>% 
    str_trim()
  
  books_df <- tibble(
    title     = titles,
    price     = prices,
    rating    = ratings,
    available = availabilities
  )
  
  return(books_df)
}

# iterate across pages
pages <- 1:5
books <- map_df(pages, get_books)
books

Exercise 2

Problem

Scrape the first page of books from each genre in the side bar on the website http://books.toscrape.com/. Scape the same information as before and include the genre.

Modify our function from Exercise 1 to scrape the books by genre.

Solution

First, grab all the genre URLs.

base_url <- "http://books.toscrape.com/"
books_main_html <- read_html("http://books.toscrape.com/index.html")

genre_urls <- books_main_html %>% 
  html_nodes(".nav-list ul a") %>%
  html_attr(name = "href") %>% 
  str_c(base_url, .)

genres <- books_main_html %>% 
  html_nodes(".nav-list ul a") %>%
  html_text() %>% 
  str_trim()

Modify our function from Exercise 1 to scrape the books by genre.

get_books_by_genre <- function(url, genre) {
  
  books_html <- read_html(url)
  
  prices <- books_html %>% 
    html_nodes(css = ".price_color") %>% 
    html_text()
  
  titles <- books_html %>% 
    html_nodes(css = ".product_pod a") %>% 
    html_attr("title") %>% 
    .[!is.na(.)]

  ratings <- books_html %>% 
    html_nodes(css = ".star-rating") %>% 
    html_attr(name = "class") %>% 
    str_remove(pattern = "star-rating ")
  
  availabilities <- books_html %>% 
    html_nodes(css = ".availability") %>% 
    html_text() %>% 
    str_trim()
  
  books_df <- tibble(
    title     = titles,
    price     = prices,
    rating    = ratings,
    available = availabilities,
    genre     = genre
  )
  
  return(books_df)
}

map2_df(genre_urls, genres, get_books_by_genre)

Exercise 3

Problem

HTML tags are composed of three things: an opening tag, content and ending tag. They each have different properties. Identify what the following tags are used for. Only the opening tag is shown.

Tag	Description
`<b>`
`<i>`
`<h3>`
`<table>`
`<tr>`
`<th>`
`<td>`
`<img>`
`<p>`

Solution

HTML tags are composed of three things: an opening tag, content and ending tag. They each have different properties. Identify what the following tags are used for. Only the opening tag is shown.

Tag	Description
`<b>`	bold
`<i>`	italics
`<h3>`	level 3 header
`<table>`	table
`<tr>`	row in a table
`<th>`	header in a table
`<td>`	cell in a table
`<img>`	image
`<p>`	paragraph

Exercises: Web scraping part I

Shawn Santo

Packages

Exercise 1

Problem

Solution

Exercise 2

Problem

Solution

Exercise 3

Problem

Solution