Photo by rawpixel on Unsplash.
Every election cycle brings its own brand of excitement – and lots of money. Political donations are of particular interest to political scientists and other researchers studying politics and voting patterns. They are also of interest to citizens who want to stay informed of how much money their candidates raise and where that money comes from.
Do you know how the amount of funds your candidates running for house or senate races have raised and spent? And how they compare to other candidates from other districts or states? In this assignment we scrape and analyze data that will help us answer these questions.
First, we will get data on candidates in North Carolina. Then, you will use a similar approach to get data on candidates in a different state of your choosing.
In order to complete this assignment you will need a Chrome browser with the Selector Gadget extension installed.
By now you should be familiar with instructions for getting started and setting up your git configuration. If not, you can refer to one of the earlier assignments.
In this assignment we will work with the tidyverse
, rvest
, robotstxt
, and lubridate
packages. These packages should already be installed in your project, and you can load them with the following:
We will scrape and work with data on campain funds. The data come from OpenSecrets.org, a “website tracking the influence of money on U.S. politics, and how that money affects policy and citizens’ lives”. This website hosted by The Center for Responsive Politics, which is a nonpartisan, independent nonprofit that “tracks money in U.S. politics and its effect on elections and public policy.”1 https://www.opensecrets.org/about/
Before getting started, let’s check that a bot has permissions to access pages on this domain.
## [1] TRUE
The goal of this exercise is scrape the data from a page that looks like the the page shown on the right, and save it as a data frame that looks like the data frame shown below.
The code for the following should go in the scrape-nc.R
file.
race
: Let’s start with the names of the races. Use the selector gadget to determine the path to the names of races, and create a character vector called race
, containing the names of the races from the first column of the table shown in the screenshot above.
race_link
: Next, grab the URLs of the races, and save them as a character vector called race_link
. Note that you need the full URL (as shown in the table above), not just the relative link.
Then, create a data frame (a tibble
) called nc_races
with the variables race
and race_link
.
There are no senate elections in 2018 in North Carolina, but some of the other states will have senate elections. In order for your code to accurately capture data from other states as well, we’ll add an indicator for the type of race: House
or Senate
.
nc_races
to add a new variable, race_type
, that detects if the word Senate
appears in the value of the race
variable for a given race. If yes, value of race_type
is set to "Senate"
, and if no, it’s set to "House"
. Two functions will help you with this task – ifelse()
and str_detect()
:nc_races.csv
into the /data
folder using the write_csv
function :Load nc_races.csv
in your R Markdown file and save it as nc_races
.
How many rows and how many variables does nc_races
have? Does this match up with your expectation?
Next we scrape data on candidates in the first district of North Carolina.
The code for the following should go in the scrape-nc.R
file.
Hint: The URL for this page is already saved in nc_races$race_link[1]
.
race_page
: Read the page for the first district and save the result as race_page
.
candidate_info
: Scrape the candidate info and save as a character vector called candidate_info
. It will look something like the following.
"G K Butterfield (D)\n • Incumbent" "Roger Allison (R)"
Hint: You’ll need to deal with the dollar signs ($
) and commas (,
) in the values.
raised
, spent
, cash_on_hand
: Scrape the raised, spent, and cash on hand funds amounts and save them as numeric values.
last_report
: Scrape the date of last_report
. This variable should have a Date
class, formatted as "yyyy-mm-dd"
, e.g. "2018-06-30"
. You can use the mdy()
function from the lubridate
package. This function takes in a character vector that is a date, and turns it into a vector with proper Date
class.
## [1] "2018-06-30"
race
: Scrape the name of the race from the header of the page. Note that this text will say "North Carolina District 01 2018 Race"
. Remove " 2018 Race"
from the character string so that it reads ""North Carolina District 01"
.
candidates
: Combine this information in a data frame called candidates
with the variables candidate_info
, raised
, spent
, cash_on_hand
, last_report
, and race
.
Lastly, we need to clean up the information in the candidate_info
data frame
Hint: You’ll need to escape the parantheses with backslashes.
party
which has the following properties:
candidate_info
contains the character string (R)
, label Republican
candidate_info
contains the character string (D)
, label Democrat
Third party
Create a new variable called status
which has the value Incumbent
if candidate_info
contains the character string Incumbent
, otherwise label Challenger
Create a new variable called candidate_name
which has only the candidate name from the candidate_info
variable. This means deleting everything starting with the first open paranthesis ((
) and trimming the character string. This is going to require getting fancy with regular expressions. To match line breaks, indicated with \n
, use "\\n"
. And to match everything starting with the first (
, use "\\((.*)"
. For example,
At this point take a break to this short but detailed article on regular expressions. Learning a bit more on regular experessions will give you a lot of facilty for working with text data.
candidate_info <- " G K Butterfield (D)\n • Incumbent\n "
# remove all line breaks
candidate_info %>%
str_remove_all("\\n")
## [1] " G K Butterfield (D) • Incumbent "
# remove all line breaks + remove everything starting with (
candidate_info %>%
str_remove_all("\\n") %>%
str_remove("\\((.*)")
## [1] " G K Butterfield "
# remove all line breaks + remove everything starting with ( + trim white spaces
candidate_info %>%
str_remove_all("\\n") %>%
str_remove("\\((.*)") %>%
str_trim()
## [1] "G K Butterfield"
Hint: select(race, everything())
would reorder the columns to move race
up to be the first variable, and all the other variables would remain in the same order as before.
Finally reorganize the data frame so that it only contains the variables candidate_name
, race
, raised
, spent
, cash_on_hand
, last_report
, party
, and status
, in this order. Note that candidate_info
is dropped. You will need to use the select()
function to reorder the columns. As a requirement for this exercise, you should also use the everything()
function, which will allow you to not have to type out all the variable names. See the help for this function to find out more about its usage.
Save the resulting candidates
data frame as nc_candidates_dist01.csv
in your data
folder.
nc_candidates_dist01.csv
in your R Markdown document, save it as nc_candidates_dist01
, and print it. Confirm that the data you have matches the one on the webpage.At this point make sure you’ve committed and pushed your work so far.
In your scrape-nc.R
file, copy and paste the code you developed in the previous section for scraping District 1 data to the appropriate section for District 2. Change the input URL to the link for the second district, nc_races$race_link[2]
.
The goal is to scrape the data for District 2 just like you scraped the data for District 1 in the previous section. You might find that you need to make some adjustments to your code. This is ok, go ahead and make them. But make sure to implement the same changes in your code for scraping District 1 candidates as well. You should get to a point where the same exact code works for both districts.
Write out the resulting data frame as nc_candidates_dist02.csv
in your data/
folder.
nc_candidates_dist02.csv
in your R Markdown document, save it as nc_candidates_dist02
, and print it. Confirm that the data you have matches the one on the webpage.Once again, make sure you’ve committed and pushed your work so far.
Now that you have confirmed your code works on two different pages, we’ll work on turning your code into a function.
In the R script titled scrape-race.R
create a function called scrape_race
that has one input, url
, and that outputs a tibble with number of rows corresponding to the number of candidates running for each race and six variables.
source()
function as shown below. This will make the function available for use in the rest of your R Markdown document.nc_candidates
as shown below. Then, glimpse()
at the resulting data frame.If you’re happy with your answer for Exercise 10, I recommend caching the results of the code chunk so that your code for scraping data from all NC races does not have to rerun every time you knit your document. To do this you need to is to add the following option to your code chunk: cache = TRUE
, e.g. if the label you used for your code chunk is map-all
, your code chunk definition will look like this:
{r map-all, cache = TRUE}
Then, knit your document one more time. Your code will run again, but this time a new folder will be created in your project where cached results will be saved. As with everything that changes in your repo, you will need to commit and push this folder as well. Going forward your code will only rerun if the code in the cached code chunk changes.
Remember that you can use everything()
for “all the other columns”.
Join the nc_candidates
and nc_races
data frames by the race
variable using an inner_join()
. Then, reorganize the columns of of the resulting data frame to be in the following order: candidate_name
, race_type
, race
, all the other columns. Save the result as nc_candidates
, and glimpse()
at the resulting data frame.
Answer the following questions based on the nc_candidates
data frame you created in Exercise 11:
a. What is the median amount of funds raised by all candidates in NC?
b. What are the median amounts of funds raised by Republican, Democratic, and third party candidates in NC?
c. Calculate the percentage of cash_on_hand
compared to raised
for all candidates in NC. Save the result as a new variable in the data frame called perc_cash_on_hand
.
d. Among all NC candidates, which candidate has the highest percentage of cash on hand?
e. On average, have incumbents or challengers raised more money? Does this vary by party affiliation?
You can copy and rename a file in RStudio using the menu items in the File pane with the same names. Check the box next to the file you want to copy, then go to More -> Copy. Then, check the box next to the newly created file to rename it using the Rename menu.
Make a copy of the scrape-nc.R
file, and rename it to match a state of your own choosing, e.g. name it scrape-ca.R
if you chose California.
Repeat what you did in the previous section (potentially with some modifications as needed) to
- Scrape the data on all races in the state of your choice, create a data frame with variables race
, race_link
, and race_type
, and save the data frame as a csv file in your data/
folder.
- Load this data file in your R Markdown document and, using the scrape_races
function you developed earlier, scrape information on the candidates running in each of the races in your state. Save the result as a data frame.
- Join the two data frames you have for this state by the race
variable, and glimpse()
at this data frame.
- Use caching as outlined in the previous section to avoid rerunning of code that scrapes data.
Answer the following questions based on the data frame you loaded in Exercise 13:
a. What is the median amount of funds raised by all candidates in your state?
b. What are the median amounts of funds raised by Republican, Democratic, and third party candidates in your state?
c. Calculate the percentage of cash_on_hand
compared to raised
for all candidates in your state. Save the result as a new variable in the data frame called perc_cash_on_hand
.
d. Among all candidates in your state, which candidate has the highest percentage of cash on hand?
e. On average, have incumbents or challengers raised more money? Does this vary by party affiliation? How do your findings compare to your earlier findings about NC?