…and do something with it. That is your final project in a nutshell. More details below.


The final project for this class will consist of a statistical analysis on a dataset of your own choosing. The dataset may already exist, or you may collect your own data using a survey or by conducting an experiment. If possible, choose a dataset which you are interested in and may be useful for another class project or your research if you are working in a lab. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class and apply them to a novel dataset in a meaningful way.


In order for you to have the greatest chance of success with this project it is important that you choose a manageable dataset. This means that the data should be readily accessible and large enough that multiple relationships can be explored. As such, your dataset must have at least 50 observations and between 10 to 20 variables (exceptions can be made but you must speak with me first). The dataset’s variables should include categorical variables, discrete numerical variables, and continuous numerical variables.

All analyses must be done in RStudio. If you are using a dataset that comes in a format that we haven’t encountered in class, make sure that you are able to load it into RStudio as this can be tricky depending on the source. If you are having trouble ask for help before it is too late.

Reusing datasets from class: You’re welcomed to use one of the Data Expeditions datasets from class: MLB/NBA data or Paris Paintings. If you choose to use one of these you should get in touch with the data experts you met in class to talk through your project. They have offered to be of assitance along the way.

However note that you’re not limited to these datasets.

Project Proposal - due Nov 26

By Nov 26 you must turn in a draft of the introduction section of your project. This introduction should introduce your general research question (this should include your hypothesized answer) and your data (where it came from, how it was collected, what are the cases, what are the variables, etc.).

The proposal should also include a short data analysis plan as well as your dataset.

The data analysis plan should include 1. The comparison groups you will use, if applicable. 2. The outcome (dependent, response, Y) and predictor (independent, explanatory, X) variables you will use to answer your question. 3. The statistical method(s) that you believe will be useful in answering your question(s). (You can update these later as you work on your project.) 4. What results from these specific statistical methods are needed to support your hypothesized answer?

The introduction should be no more than 1 page (excluding figures) and the data analysis plan should also be no more than 1 page. Your write up and all typesetting must be done with using RMarkdown.

Project - due December 4

After providing the description of your dataset and research question in the introduction you must apply what you have learned about descriptive statistics, visualizations, correlation and regression, and inference (hypothesis testing and confidence intervals) to your dataset. The goal is not to do an exhaustive data analysis i.e., do not calculate every statistic and procedure you have learned for every variable, but rather let me know that you are proficient at using R, and that you are proficient at interpreting and presenting the results. Focus on methods that help you begin to answer your research questions. Also pay attention to your presentation. Neatness, coherency, and clarity will count.

Presentation - due December 4

10 minutes maximum, and each team member should say something substantial.

You can use any software you like for your final presentation. (Remember that you can create slides using R Markdown as well.) There isn’t a limit to how many slides you can use, just a time limit (10 minutes total). Each team member should get a chance to speak during the presentation. Your presentation should not just be an account of everything you tried (“then we did this, then we did this, etc.”), instead it should convey what choices you made, and why, and what you found.


Your submission should include

Style and format does count for this assignment, so please take the time to make sure everything looks good and your data and code are properly formated.

Further details


The project is very open ended. You should create somekind of compelling visualization(s) of this data in R. There is no limit on what tools or packages you may use. Remember that there are many different ways to plot in R (e.g. base graphics, ggplot, etc.) and there are many additional packages that are focused for specific types of data and visualization tasks. See here, here and here for brief introductions to some of R’s visualization capabilities. You can even create interactive graphics, or webapps using R. You do not need to visualize all of the data at once. A single high quality visualization will receive a much higher grade than a large number of poor quality visualizations.

Your write up must also include a one to two page conclusion and discussion. This will require a summary of what you have learned about your research question along with statistical arguments supporting your conclusions. Also critique your own methods and provide suggestions for improving your analysis. Issues pertaining to the reliability and validity of your data, and appropriateness of the statistical analysis should be discussed here. A paragraph on what you would do differently if you were able to start over with the project or what you would do next if you were going to continue work on the project should also be included.


In your write up your code should be hidden (echo = FALSE) so that your document is neat and easy to read. However your document should include all your code such that if I re-knit your Rmd file I should be able to obtain the results you presented.

Exception: If you want to highlight something specific about a piece of code, you’re welcomed to show that portion.

For this class I will not enforce any particular coding style, however it is important that the code you and your team produces is readable and consistent in its formating. There are several R style guides online, e.g. from Google and from Hadley Wickham. As a group you should decide on what conventions you will use and the entire team should conform to them as much as possible.


You are to complete the assignment as a team. All team members are expected to contribute equally to the completion of this assignment and group assessments will be given at its completion - anyone judged to not have sufficient contributed to the final product will have their grade penalized. While different teams members may have different backgrounds and abilities, it is the responsibility of every team member to understand how and why all code and approaches in the assignment works.


Your grade will be based

Grading of the project will take into account the following:

A general breakdown of scoring is as follows:

If you score less 30% on the project you cannot pass this course, and late projects are applied a 10% per day penalty.