Due: 2018-04-19 at noon
In 2017, the U.S. beer industry shipped (sold) 207.4 million barrels of beer – equivalent to more than 2.9 billion cases of 24-12 ounce servings. In addition, the industry shipped approximately 2 million barrels of cider, equivalent to more than 28.3 million cases. Additionally, the U.S. beer industry sells more than $111.1 billion in beer and malt-based beverages to U.S. consumers each year. (Source: The U.S. Beer Industry)
Do you know how many breweries are around you? And how much beer they brew? In this lab we scrape and analyze data on US breweries.
We will start with getting data on breweries in North Carolina. Then, you will use a similar approach to get data on breweries in a different state of our choosing.
In order to complete this lab you will need a Chrome browser with the Selector Gadget extension installed.
By now you should be familiar with instructions for getting started and setting up your git configuration. If not, you can refer to one of the earlier labs.
In this lab we will work with the tidyverse
, rvest
, and robotstxt
packages. These packages should already be installed in your project, and you can load them with the following:
library(tidyverse)
library(rvest)
library(robotstxt)
We will scrape brewery information from https://www.ratebeer.com/breweries/. RateBeer.com is an in-depth, consumer-driven source of beer information. We will use the state-level brewery lists on this site to first obtain information on all breweries in a given state. Then, we will dive deeper and obtain additional information on each of the breweries in that state, one-by-one, by automating our code to do so.
Before getting started, let’s check that a bot has permissions to access pages on this domain.
paths_allowed("https://www.ratebeer.com/")
##
www.ratebeer.com
## [1] TRUE
The goal of this exercise is scrape the data from
and save it as the following data frame.
The code for the following should go in the script.R file.
Next we get the cities. The paths for the cities on the active vs. closed page are different.
active_cities
and the other closed_cities
that contain the cities of the active and closed breweries, respectively. Then, combine these vectos withcities <- c(active_cities, closed_cities)
Scrape brewery type, number of beers brewed at each brewery, year when brewery first opened (“est.”)s. Save these as vectors called types
, counts
, ests
, urls
, and status
respectively.
Create a data frame (tibble
) called ncbreweries
with column names shown in the table above.
Save this data frame as ncbreweries.csv
using the write_csv
function into the /data
folder: write.csv(ncbreweries.csv, path = "data/ncbreweries.csv")
.
Load ncbreweries.csv
in your Rmd.
There is at least one error in the data: Edenton Brewing Company appears to have been opened in 1900, but this is not true. Find out when this Brewery was opened, and correct the data.
Which city in NC has the most breweries? How many breweries are in Durham, NC? What are they?
Recreate the following visualization, and interpret it.
Do the following in the script.R
file.
Repeat what you did above (potentially with some modifications) to create a similar data frame (with the same columns) for a different state of your choosing. Save the result in a csc file with the name of the state you chose, e.g. cabreweries.csv
.
Load the new data in the R Markdown.
Determine which city in that state has the highest number breweries.
Determine which city has the youngest breweries, on average.
Write a function to grab the zip code for each of the breweries from their own pages (at the URL you recorded). Test this function out on the first three breweries.