Appearances of candidates’ names on Twitter can help predict election results

A group of researchers at Indiana University has announced a research effort demonstrating that appearances of candidates’ names on Twitter can help predict election results.

The Washington Post picked this up in an editorial:

New research in computer science, sociology and political science shows that data extracted from social media platforms yield accurate measurements of public opinion. It turns out that what people say on Twitter or Facebook is a very good indicator of how they will vote.

How good? […] co-authors Joseph DiGrazia, Karissa McKelvey, Johan Bollen and I show that Twitter discussions are an unusually good pre- dictor of U.S. House elections. Using a massive archive of billions of randomly sampled tweets stored at Indiana University, we extracted 542,969 tweets that mention a Demo- cratic or Republican candidate for Congress in 2010. For each congressional district, we computed the percentage of tweets that mentioned these candidates. We found a strong correlation between a candidate’s tweet share and the final two-party vote share, especially when we account for a district’s economic, racial and gender profile. In the 2010 data, our Twitter data predicted the winner in 404 out of 406 competitive races.

Data science

This is a true data science research project, in that:

Put on your statistician hat

Spend a few minutes reading the Rojas editorial and skimming the actual paper. Be sure to consider Figure 1 and Table 1 carefully, and address the following questions.

  1. Write a sentence summarizing the findings of the paper.

  2. Discuss Figure 1 in your team. What is its purpose? What does it convey? Think critically about this data visualization. What would you do differently?

  3. Discuss with in your team the differences between the Bivariate model and the Full Model. Which one do you think does a better job of predicting the outcome of an election? Which one do you think best addresses the influence of tweets on an election?

  4. Why do you suppose that the coefficient of RepublicanTweetShare is so much larger in the Bivariate model? How does this reflect on the influence of tweets in an election?

  5. Do you think the study holds water? Why or why not? What are the shortcomings of this study?

Put on your data scientist hat

Now it’s time to put on your data scientist hat. Imagine that your boss, who does not have advanced technical skills or knowledge, asked you to reproduce the study you just read. Discuss the following in your team.

  1. What steps are necessary to reproduce this study? Be as specific as you can! Try to list the subtasks that you would have to perform.

  2. What computational tools would you use for each task?

Acknowledgements

Note: Thanks to Ben Baumer for this example!