library(tidyverse)
library(rvest)
Go to http://books.toscrape.com/catalogue/page-1.html and scrape the first five pages of data on books with regards to their
Organize your results in a neatly formatted tibble similar to below.
get_books <- function(page) {
base_url <- "http://books.toscrape.com/catalogue/page-"
url <- str_c(base_url, page, ".html")
books_html <- read_html(url)
prices <- books_html %>%
html_nodes(css = ".price_color") %>%
html_text()
titles <- books_html %>%
html_nodes(css = ".product_pod a") %>%
html_attr("title") %>%
.[!is.na(.)]
ratings <- books_html %>%
html_nodes(css = ".star-rating") %>%
html_attr(name = "class") %>%
str_remove(pattern = "star-rating ")
availabilities <- books_html %>%
html_nodes(css = ".availability") %>%
html_text() %>%
str_trim()
books_df <- tibble(
title = titles,
price = prices,
rating = ratings,
available = availabilities
)
return(books_df)
}
# iterate across pages
pages <- 1:5
books <- map_df(pages, get_books)
books
Scrape the first page of books from each genre in the side bar on the website http://books.toscrape.com/. Scape the same information as before and include the genre.
Modify our function from Exercise 1 to scrape the books by genre.
First, grab all the genre URLs.
base_url <- "http://books.toscrape.com/"
books_main_html <- read_html("http://books.toscrape.com/index.html")
genre_urls <- books_main_html %>%
html_nodes(".nav-list ul a") %>%
html_attr(name = "href") %>%
str_c(base_url, .)
genres <- books_main_html %>%
html_nodes(".nav-list ul a") %>%
html_text() %>%
str_trim()
Modify our function from Exercise 1 to scrape the books by genre.
get_books_by_genre <- function(url, genre) {
books_html <- read_html(url)
prices <- books_html %>%
html_nodes(css = ".price_color") %>%
html_text()
titles <- books_html %>%
html_nodes(css = ".product_pod a") %>%
html_attr("title") %>%
.[!is.na(.)]
ratings <- books_html %>%
html_nodes(css = ".star-rating") %>%
html_attr(name = "class") %>%
str_remove(pattern = "star-rating ")
availabilities <- books_html %>%
html_nodes(css = ".availability") %>%
html_text() %>%
str_trim()
books_df <- tibble(
title = titles,
price = prices,
rating = ratings,
available = availabilities,
genre = genre
)
return(books_df)
}
map2_df(genre_urls, genres, get_books_by_genre)
HTML tags are composed of three things: an opening tag, content and ending tag. They each have different properties. Identify what the following tags are used for. Only the opening tag is shown.
Tag | Description |
---|---|
<b> |
|
<i> |
|
<h3> |
|
<table> |
|
<tr> |
|
<th> |
|
<td> |
|
<img> |
|
<p> |
HTML tags are composed of three things: an opening tag, content and ending tag. They each have different properties. Identify what the following tags are used for. Only the opening tag is shown.
Tag | Description |
---|---|
<b> |
bold |
<i> |
italics |
<h3> |
level 3 header |
<table> |
table |
<tr> |
row in a table |
<th> |
header in a table |
<td> |
cell in a table |
<img> |
image |
<p> |
paragraph |