General info

  • Class: Tuesdays and Thursdays, 4:40 pm - 5:55 pm, Perkins 2-065

  • Office hours:
    • Mine: Old Chem 213, Mondays and Wednesdays, 4-5pm, or by appointment
    • SECC: Old Chem 211A, Sunday - Thursday 4-9pm
  • Course webpage:


  • Grading breakdown:
    • Participation & peer evaluations - 10%
    • Homework - 30%
    • Midterm project - 15%
    • Final project - 20%
    • Final exam - 25%
  • Class attendance is a firm expectation; frequent absences or tardiness will be considered a legitimate cause for grade reduction

  • Exact ranges for letter grades will be curved and cutoffs will be determined after the final exam

  • The more evidence there is that the class has mastered the material, the more generous the curve will be

Class meetings

  • Interactive

  • Learn-by-doing

  • Bring your laptop to class every day


  • Short survey to gauge your previous exposure to topics relevant to the course

  • Teams of 3-5 students for in-class activities and projects


  • Larger computational tasks twice throughout the semester

  • Present findings to class

  • Collaborative / fully reproducible work


  • Single take home final exam that you are expected to complete individually

  • Complete a number of small computational tasks that cover the breadth of the material presented in the class

Email & Piazza

  • I will regularly send announcements by email, please make sure to check your email daily

  • Any non-personal content-related questions should be posted on Piazza

  • Before posting a new question please make sure to check if your question has already been answered, and answer others’ questions

Academic integrity

Duke Community Standard:

  • I will not lie, cheat, or steal in my academic endeavors;

  • I will conduct myself honorably in all my endeavors; and

  • I will act if the Standard is compromised.

Reusing / sharing code

  • A huge volume of code is available on the web to solve any number of problems

  • Unless I explicitly tell you not to use these resource the course’s policy is that you may make use of these resources (e.g. StackOverflow) but you must explicitly cite where the code was obtained from in your comments

  • Any recycled coded that is discovered and is not explicitly cited will be treated as plagiarism

  • The one exception to this rule is that you may not directly share code with another team in this class, you are welcome to discuss the problems together and ask for advice, but you may not send or make use of code from another team

Excused absences

  • Students who miss graded work due to a scheduled varsity trip, religious holiday or short-term illness should fill out an online NOVAP, RHoliday or short-term illness form respectively

  • If you cannot complete an assignment on the due date due to a short-term illness, you have until noon the following day to complete it at no penalty, then the regular late work policy kicks in

  • Those with a personal emergency or bereavement should seek a Dean’s Excuse; check with your academic dean for details

Late work policy

  • late, but same day: -10%

  • late, next day: -20%

  • 2 days or later: no credit


  • Please refrain from texting or using your computer for anything other than coursework during class

  • You must be in class on a day when you’re scheduled to present, there are no make ups for presentations

  • Regrade requests must be made within 3 days of when the assignment is returned, and must be submitted in writing

  • Only work that is clearly assigned as team work can be completed collaboratively

  • Use of disallowed materials during the take home exam will not be tolerated

Can Twitter predict election results?

“appearances of candidates’ names on Twitter can help predict election results”

A group of researchers at Indiana University has announced a research effort demonstrating that appearances of candidates’ names on Twitter can help predict election results.

The Washington Post picked this up in an editorial:

New research in computer science, sociology and political science shows that data extracted from social media platforms yield accurate measurements of public opinion. It turns out that what people say on Twitter or Facebook is a very good indicator of how they will vote.

How good? […] co-authors Joseph DiGrazia, Karissa McKelvey, Johan Bollen and I show that Twitter discussions are an unusually good pre- dictor of U.S. House elections. Using a massive archive of billions of randomly sampled tweets stored at Indiana University, we extracted 542,969 tweets that mention a Demo- cratic or Republican candidate for Congress in 2010. For each congressional district, we computed the percentage of tweets that mentioned these candidates. We found a strong correlation between a candidate’s tweet share and the final two-party vote share, especially when we account for a district’s economic, racial and gender profile. In the 2010 data, our Twitter data predicted the winner in 404 out of 406 competitive races.

Data science

This is a true data science research project, in that: * The data being analyzed was scraped from the Internet, not collected from a controlled, ran- domized trial. Typical statistical assumptions about random sampling, etc. do not apply!

  • The research question is addressed by combining domain knowledge (i.e. knowledge of how Congressional races work) with a data source (Twitter) that has no obvious relevance.

  • A large amount of data (500 million tweets!) was collected. [Note: only 500,000 tweets were analyzed.] In this case, the data was big enough that the Center for Complex Networks and Systems Research had to get involved!

  • The project was undertaken by a team of researchers from different fields (i.e. sociology, computing) working in different departments, and bringing different skills to the table.

Put on your statistician hat

Spend a few minutes reading the Rojas editorial and skimming the actual paper. Be sure to consider Figure 1 and Table 1 carefully, and address the following questions.

  1. Write a sentence summarizing the findings of the paper.

  2. Discuss Figure 1 in your team. What is its purpose? What does it convey? Think critically about this data visualization. What would you do differently?

  3. Discuss with in your team the differences between the Bivariate model and the Full Model. Which one do you think does a better job of predicting the outcome of an election? Which one do you think best addresses the influence of tweets on an election?

  4. Why do you suppose that the coefficient of RepublicanTweetShare is so much larger in the Bivariate model? How does this reflect on the influence of tweets in an election?

  5. Do you think the study holds water? Why or why not? What are the shortcomings of this study?

Put on your data scientist hat

Now it’s time to put on your data scientist hat. Imagine that your boss, who does not have advanced technical skills or knowledge, asked you to reproduce the study you just read. Discuss the following in your team.

  1. What steps are necessary to reproduce this study? Be as specific as you can! Try to list the subtasks that you would have to perform.

  2. What computational tools would you use for each task?


Note: Thanks to Ben Baumer for this example!

Computation: getting started

Install R & RStudio


code school / try R:

Homework 1: Finish the R course on code school, submit your record of completion on Sakai

Due date: September 2 (next Tuesday)