--- title: "Web Scraping" subtitle: "Statistical Computing & Programming" author: "Shawn Santo" institute: "" date: "06-01-20" output: xaringan::moon_reader: css: "slides.css" lib_dir: libs nature: highlightStyle: github highlightLines: true countIncrementalSlides: false editor_options: chunk_output_type: console --- ```{r include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, comment = "#>", highlight = TRUE, fig.align = "center") ``` ## Supplementary materials Companion videos - [Recap and introduction to HTML](https://warpwire.duke.edu/w/z88DAA/) - [Introduction to `rvest`](https://warpwire.duke.edu/w/0c8DAA/) - [Subsetting HTML documents](https://warpwire.duke.edu/w/088DAA/) - [Live demo with `rvest`](https://warpwire.duke.edu/w/1c8DAA/) Additional resources - [SelectorGadget Vignette](https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html) - `rvest` [website](https://rvest.tidyverse.org/) --- class: inverse, center, middle # Recap --- ## Summary of packages .small-text[ | Task | Package | Cheat sheet | |---------------------|-------------|------------------------------------------------------------------------------| | Visualize data | `ggplot2` | https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf | | Wrangle data frames | `dplyr` | https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf | | Reshape data frames | `tidyr` | https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf | | Iterate | `purrr` | https://github.com/rstudio/cheatsheets/raw/master/purrr.pdf | | Text manipulation | `stringr` | https://github.com/rstudio/cheatsheets/raw/master/strings.pdf | | Manipulate factors | `forcats` | https://github.com/rstudio/cheatsheets/raw/master/factors.pdf | | Manipulate dates | `lubridate` | https://github.com/rstudio/cheatsheets/raw/master/lubridate.pdf | ]
You don't need to memorize every function in these packages. Just know where you need to look when you come across a specific problem. --- class: inverse, center, middle # HTML --- ## Hypertext Markup Language - HTML describes the structure of a web page; your browser interprets the structure and contents and displays the results. - The basic building blocks include elements, tags, and attributes. - an element is a component of an HTML document - elements contain tags (start and end tag) - attributes provide additional information about HTML elements
--- ## Simple HTML document ```html Web Scraping

Using rvest

To get started...

```

We can visualize this in a tree-like structure. --- ## HTML tree-like structure
If we have access to an HTML document, then how can we easily extract information? --- class: inverse, center, middle # Package `rvest` --- ## Package `rvest` `rvest` is a package authored by Hadley Wickham that makes basic processing and manipulation of HTML data easy. ```{r} library(rvest) ``` Core functions: | Function | Description | |---------------------|-------------------------------------------------------------------| | `xml2::read_html()` | read HTML from a character string or connection | | `html_nodes()` | select specified nodes from the HTML document using CSS selectors | | `html_table()` | parse an HTML table into a data frame | | `html_text()` | extract tag pairs' content | | `html_name()` | extract tags' names | | `html_attrs()` | extract all of each tag's attributes | | `html_attr()` | extract tags' attribute value by name | --- ## HTML in R ```{r} simple_html <- " Web Scraping

Using rvest

To get started...

" ``` -- ```{r} simple_html ``` --- ```{r} html_doc <- read_html(simple_html) attributes(html_doc) ``` --
```{r} html_doc ``` --- ## CSS selectors To extract components out of HTML documents use `html_nodes()` and CSS selectors. In CSS, selectors are patterns used to select elements you want to style. We can determine the necessary CSS selectors we need via the point-and-click tool [selector gadget](https://selectorgadget.com/). More on this in a moment. .small-text[ Selector | Example | Description :-----------------|:-----------------|:-------------------------------------------------- element | `p` | Select all <p> elements element element | `div p` | Select all <p> elements inside a <div> element element>element | `div > p` | Select all <p> elements with <div> as a parent .class | `.title` | Select all elements with class="title" #id | `#name` | Select all elements with id="name" [attribute] | `[class]` | Select all elements with a class attribute [attribute=value] | `[class=title]` | Select all elements with class="title" ] For more CSS selector references click [here](https://www.w3schools.com/cssref/css_selectors.asp). ??? - CSS stands for Cascading Style Sheets. - CSS describes how HTML elements are to be displayed on screen, paper, or in other media. - CSS can be added to HTML elements in 3 ways: - Inline - by using the style attribute in HTML elements - Internal - by using a