The 2014 FIFA World Cup took place this summer in Brazil, during the course of the tournament 171 goals were scored (not including penalty shootouts). The goal of this assignment is to work with semistructured data obtained from the web to construct a usable dataset within R that you will then visualize.

The data we will use for this project is available here, along with the codebook. This is an extract that has been taken from website of Emil Johansson who has constructed a number of World Cup 2014 visualizations.

There are any number of ways to approach this task, there is no one right solution - be as creative as possible. If you want to supplement these data with outside sources, you are more than welcome to, just be sure to document where the data comes from and why it is included. \

Loading the data

To directly load the data, use the following (or a slightly modified version depending on your operating system):

download.file("https://stat.duke.edu/courses/Fall14/sta112.01/data/wc14.csv", destfile = "wc14.csv", method = "curl")
wc14 = read.csv("wc14.csv")

Task 1: Visualization

The project is very open ended. You should create somekind of compelling visualization(s) of this data in R. There is no limit on what tools or packages you may use, the only requirement is that code and results must be entirely self contained in your R markdown file. For example there are many different ways to plot in R (e.g. base graphics, ggplot, etc.) and there are many additional packages that are focused for specific types of data and visualization tasks. See here, here and here for brief introductions to some of R’s visualization capabilities. You can even create interactive graphics, or webapps using R.

You do not need to use all of the data, you are welcome to focus on a single player or single team, or any combination there of. A single high quality visualization will receive a much higher grade than a large number of poor quality visualizations.

The visualization you produce should include another brief write up giving the context of your visualization and why you believe it is compelling.

Task 2: Inference and/or modeling

While we have “population” data from all goals scored in World Cup 2014, these data still represent a random instance of the teams’ performance over time. Pose (at least two) questions that can be answered using inference and modeling tools we have learned about in class, and answer them. Make sure that your research questions are clearly stated, the methods you’re using briefly discussed, and conclusions are in context of the data.

Submission and grading

This homework is due by class on Tuesday, October 21. On Sakai, submit:

  1. a fully reproducible R Markdown file
  2. the resulting HTML file
  3. your presentation

You are to complete the assignment as a group. All team members are expected to contribute equally to the completion of this assignment and group assessments will be given at its completion - anyone judged to not have sufficient contributed to the final product will have their grade penalized. While different teams members may have different backgrounds and abilities, it is the responsibility of every team member to understand how and why all code and approaches in the assignment works.

The final products for this assignment are a single R Markdown file (formated to clearly present all of your code and results) as well as a presentation of 10 minutes maximum. Style and format does count for this assignment, so please take the time to make sure everything looks good and your data and code are properly formated.

For this class I will not enforce any particular coding style, however it is important that the code you and your team produces is readable and consistent in its formating. There are several R style guides online, e.g. from Google and from Hadley Wickham. As a group you should decide on what conventions you will use and the entire team should conform to them as much as possible.

You can use any software you like for your final presentation. (Note that you can create slides using R Markdown as well.) There isn’t a limit to how many slides you can use, just a time limit (10 minutes total). Each team member should get a chance to speak during the presentation. Your presentation should not just be an account of everything you tried (“then we did this, then we did this, etc.”), instead it should convey what choices you made, and why, and what you found.