August 26, 2014
Class: Tuesdays and Thursdays, 4:40 pm - 5:55 pm, Perkins 2-065
Course webpage: https://stat.duke.edu/courses/Fall14/sta112.01/
Class attendance is a firm expectation; frequent absences or tardiness will be considered a legitimate cause for grade reduction
Exact ranges for letter grades will be curved and cutoffs will be determined after the final exam
The more evidence there is that the class has mastered the material, the more generous the curve will be
Bring your laptop to class every day
Short survey to gauge your previous exposure to topics relevant to the course
Teams of 3-5 students for in-class activities and projects
Larger computational tasks twice throughout the semester
Present findings to class
Collaborative / fully reproducible work
Single take home final exam that you are expected to complete individually
Complete a number of small computational tasks that cover the breadth of the material presented in the class
I will regularly send announcements by email, please make sure to check your email daily
Any non-personal content-related questions should be posted on Piazza
Before posting a new question please make sure to check if your question has already been answered, and answer others' questions
Duke Community Standard:
I will not lie, cheat, or steal in my academic endeavors;
I will conduct myself honorably in all my endeavors; and
I will act if the Standard is compromised.
A huge volume of code is available on the web to solve any number of problems
Unless I explicitly tell you not to use these resource the course’s policy is that you may make use of these resources (e.g. StackOverflow) but you must explicitly cite where the code was obtained from in your comments
Any recycled coded that is discovered and is not explicitly cited will be treated as plagiarism
The one exception to this rule is that you may not directly share code with another team in this class, you are welcome to discuss the problems together and ask for advice, but you may not send or make use of code from another team
Students who miss graded work due to a scheduled varsity trip, religious holiday or short-term illness should fill out an online NOVAP, RHoliday or short-term illness form respectively
If you cannot complete an assignment on the due date due to a short-term illness, you have until noon the following day to complete it at no penalty, then the regular late work policy kicks in
Those with a personal emergency or bereavement should seek a Dean’s Excuse; check with your academic dean for details
late, but same day: -10%
late, next day: -20%
2 days or later: no credit
Please refrain from texting or using your computer for anything other than coursework during class
You must be in class on a day when you're scheduled to present, there are no make ups for presentations
Regrade requests must be made within 3 days of when the assignment is returned, and must be submitted in writing
Only work that is clearly assigned as team work can be completed collaboratively
Use of disallowed materials during the take home exam will not be tolerated
A group of researchers at Indiana University has announced a research effort demonstrating that appearances of candidates’ names on Twitter can help predict election results.
The Washington Post picked this up in an editorial:
New research in computer science, sociology and political science shows that data extracted from social media platforms yield accurate measurements of public opinion. It turns out that what people say on Twitter or Facebook is a very good indicator of how they will vote.
How good? […] co-authors Joseph DiGrazia, Karissa McKelvey, Johan Bollen and I show that Twitter discussions are an unusually good pre- dictor of U.S. House elections. Using a massive archive of billions of randomly sampled tweets stored at Indiana University, we extracted 542,969 tweets that mention a Demo- cratic or Republican candidate for Congress in 2010. For each congressional district, we computed the percentage of tweets that mentioned these candidates. We found a strong correlation between a candidate’s tweet share and the final two-party vote share, especially when we account for a district’s economic, racial and gender profile. In the 2010 data, our Twitter data predicted the winner in 404 out of 406 competitive races.
This is a true data science research project, in that: * The data being analyzed was scraped from the Internet, not collected from a controlled, ran- domized trial. Typical statistical assumptions about random sampling, etc. do not apply!
The research question is addressed by combining domain knowledge (i.e. knowledge of how Congressional races work) with a data source (Twitter) that has no obvious relevance.
A large amount of data (500 million tweets!) was collected. [Note: only 500,000 tweets were analyzed.] In this case, the data was big enough that the Center for Complex Networks and Systems Research had to get involved!
The project was undertaken by a team of researchers from different fields (i.e. sociology, computing) working in different departments, and bringing different skills to the table.
Spend a few minutes reading the Rojas editorial and skimming the actual paper. Be sure to consider Figure 1 and Table 1 carefully, and address the following questions.
Write a sentence summarizing the findings of the paper.
Discuss Figure 1 in your team. What is its purpose? What does it convey? Think critically about this data visualization. What would you do differently?
Discuss with in your team the differences between the Bivariate model and the Full Model. Which one do you think does a better job of predicting the outcome of an election? Which one do you think best addresses the influence of tweets on an election?
Why do you suppose that the coefficient of RepublicanTweetShare is so much larger in the Bivariate model? How does this reflect on the influence of tweets in an election?
Do you think the study holds water? Why or why not? What are the shortcomings of this study?
Now it’s time to put on your data scientist hat. Imagine that your boss, who does not have advanced technical skills or knowledge, asked you to reproduce the study you just read. Discuss the following in your team.
What steps are necessary to reproduce this study? Be as specific as you can! Try to list the subtasks that you would have to perform.
What computational tools would you use for each task?
Note: Thanks to Ben Baumer for this example!
code school / try R: https://www.codeschool.com/courses/try-r
Homework 1: Finish the R course on code school, submit your record of completion on Sakai
Due date: September 2 (next Tuesday)