class: center, middle, inverse, title-slide # Web Scraping Part II ## Statistical Programming ### Adopted from STA 523, Professor Shawn Santo ### 11-07-19 --- class: inverse, center, middle # Recall --- ## Hypertext Markup Language - Markup language is a computer language that uses tags to define elements within a document. It contains standard words, rather than typical programming syntax. The two most popular markup languages are HTML and XML. - HTML describes the structure of a web page; your browser interprets the structure and contents and displays the results. - The basic building blocks include elements, tags, and attributes. - an element is a component of an HTML document - elements are generally wrapped in tags (start and end tag) - attributes provide additional information about HTML elements <center> <img src="https://shawnsanto.com/files/sta523/slides/images/html-structure.png" height="300" width="450"> </center> --- ## SelectorGadget [CSS](https://www.w3schools.com/whatis/whatis_css.asp) stands for "Cascading Style Sheets". In CSS, selectors are patterns used to select the element(s) you want to style. [SelectorGadget](https://selectorgadget.com/) makes identifying the CSS selector you need as easy as clicking on items on a webpage. <center> <iframe title="vimeo-player" src="https://player.vimeo.com/video/52055686" width="600" height="400" frameborder="0" allowfullscreen></iframe> </center> --- ## Web scraping workflow 1. Understand the website's hierarchy and what information you need. -- 2. Use SelectorGadget to identify relevant CSS selectors. -- 3. Read html by passing a url and subset the resulting html document using CSS selectors. ```r read_html(url) %>% html_nodes(css = "specified_css_selector") ``` -- 4. Further extract attributes, text, or tags by adding another layer with ```r read_html(url) %>% html_nodes(css = "specified_css_selector") %>% html_*() ``` where `*` is `text`, `attr`, `attrs`, `name`, or `table`. --- ## Overview  *Source*: https://github.com/yusuzech/r-web-scraping-cheat-sheet/blob/master/README.md --- class: inverse, center, middle # More on `rvest` --- ## Limitations of using `rvest` - Difficult to make your code reproducible long term. When a website or the HTML changes your code may not work. - CSS selectors change - Contents are moved - Switch from HTML to JavaScript <br><br> -- - Websites that rely heavily on JavaScript - https://x.company/ - http://www.visithumboldt.com/ - https://mtcubacenter.org/ --- ## What is JavaScript? - Scripting language for building interactive web pages - Basis for web, mobile, and network applications - Every browser has a JavaScript engine that can execute JavaScript code. - Chrome: V8 <br> -- If `read_html()` is meant for HTML, what can we do? --- ## Possible solutions 1. Execute JavaScript in R 2. Use Chrome's developer tools 3. Use other packages - [Rselenium](http://ropensci.github.io/RSelenium/) - [PhantomJS](https://phantomjs.org/download.html) <br> We'll focus on the second option... -- >In order for the information to get from their server and show up on a page in your browser, that information had to have been returned in an HTTP response somewhere. <br><br> It usually means that you won’t be making an HTTP request to the page’s URL that you see at the top of your browser window, but instead you’ll need to find the URL of the AJAX request that’s going on in the background to fetch the data from the server and load it into the page.<br><br> --- Hartley Brody --- class: inverse, center, middle # Live demo --- class: inverse, center, middle # Web scraping considerations --- ## Best practices - Abide by a site's terms and conditions. -- - Respect robots.txt. - https://www.facebook.com/robots.txt - https://www.wegmans.com/robots.txt - https://www.google.com/robots.txt - Carefully read the terms and conditions -- - Cache your `read_html()` chunks. Isolate these chunks. -- - Avoid using `read_html()` in code that is iterated. -- - Do not overload the server at peak hours. - Implement delayed crawls: `Sys.sleep(rexp(1) + 4)` -- - If available, use a site's API. -- - Do not violate any copyright laws. --- ## To avoid a ban - Follow the best practices on the previous slide. <br><br><br> -- - Disguise your IP address. - `httr::use_proxy()` <br><br><br> -- - Avoid scraping behind pages protected by log-in, unless it is permitted by the site. - `html_session()` <br><br><br> -- - Watch out for honey pot traps - invisible links to normal visitors, but present in HTML code and found by web scrapers. --- ## Exercise #### Sheetz vs Wawa 1. Go to https://www.wawa.com. Scrape all the stores with fuel available in a 20 mile radius of zip code 18104. What is the average price of fuel per gallon for each grade?<br><br> JSON data can be read in by `jsonlite::read_json()`. 2. Go to http://www2.stat.duke.edu/~sms185/data/fuel/bystore/zteehs/regions.html. Scrape the Sheetz data using the provided URL. You will need to navigate through the website to obtain all the data. Space out your requests when downloading the data. --- ## References - https://selectorgadget.com/ - https://github.com/yusuzech/r-web-scraping-cheat-sheet/blob/master/README.md - https://techterms.com/definition/markup_language