Lab 10 - Say cheers to web scraping!

2018-04-12

Due: 2018-04-19 at noon

In 2017, the U.S. beer industry shipped (sold) 207.4 million barrels of beer – equivalent to more than 2.9 billion cases of 24-12 ounce servings. In addition, the industry shipped approximately 2 million barrels of cider, equivalent to more than 28.3 million cases. Additionally, the U.S. beer industry sells more than $111.1 billion in beer and malt-based beverages to U.S. consumers each year. (Source: The U.S. Beer Industry)

Do you know how many breweries are around you? And how much beer they brew? In this lab we scrape and analyze data on US breweries.

We will start with getting data on breweries in North Carolina. Then, you will use a similar approach to get data on breweries in a different state of our choosing.

In order to complete this lab you will need a Chrome browser with the Selector Gadget extension installed.

By now you should be familiar with instructions for getting started and setting up your git configuration. If not, you can refer to one of the earlier labs.

Packages

In this lab we will work with the tidyverse, rvest, and robotstxt packages. These packages should already be installed in your project, and you can load them with the following:

library(tidyverse) 
library(rvest)
library(robotstxt)

The data

We will scrape brewery information from https://www.ratebeer.com/breweries/. RateBeer.com is an in-depth, consumer-driven source of beer information. We will use the state-level brewery lists on this site to first obtain information on all breweries in a given state. Then, we will dive deeper and obtain additional information on each of the breweries in that state, one-by-one, by automating our code to do so.

Before getting started, let’s check that a bot has permissions to access pages on this domain.

paths_allowed("https://www.ratebeer.com/")
## 
 www.ratebeer.com
## [1] TRUE

North Carolina

The goal of this exercise is scrape the data from

and save it as the following data frame.

  1. Based on the information on the North Carolina breweries page, how many total (active + closed) breweries are there in NC?

The code for the following should go in the script.R file.

Next we get the cities. The paths for the cities on the active vs. closed page are different.

cities <- c(active_cities, closed_cities)
  1. Load ncbreweries.csv in your Rmd.

  2. There is at least one error in the data: Edenton Brewing Company appears to have been opened in 1900, but this is not true. Find out when this Brewery was opened, and correct the data.

  3. Which city in NC has the most breweries? How many breweries are in Durham, NC? What are they?

  4. Recreate the following visualization, and interpret it.

Choose your own state

Do the following in the script.R file.

Repeat what you did above (potentially with some modifications) to create a similar data frame (with the same columns) for a different state of your choosing. Save the result in a csc file with the name of the state you chose, e.g. cabreweries.csv.

Load the new data in the R Markdown.

  1. Determine which city in that state has the highest number breweries.

  2. Determine which city has the youngest breweries, on average.

  3. Write a function to grab the zip code for each of the breweries from their own pages (at the URL you recorded). Test this function out on the first three breweries.