Introduction

TLDR: Pick (or create) a dataset and do something with it. That is your final project.

The final project for this class will consist of analysis on a dataset of your own choosing or creation. The dataset may already exist, or you may collect your own data using a survey, by conducting an experiment, or by scraping the web. You can choose the data based on your interests or based on work in other courses or research projects. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class (and beyond, if you like) and apply them to a dataset in a meaningful way.

Data sources

In order for you to have the greatest chance of success with this project it is important that you choose a manageable dataset. This means that the data should be readily accessible and large enough that multiple relationships can be explored. As such, your dataset must have at least 50 observations and at least 10 variables (exceptions can be made but you must speak with me first). The dataset’s variables should include categorical variables, discrete numerical variables, and continuous numerical variables.

All analyses must be done in RStudio. If you are using a dataset that comes in a format that we haven’t encountered in class (for instance, a .DAT file), make sure that you are able to load it into RStudio as this can be tricky depending on the source. If you are having trouble, ask for help before it is too late.

Reusing datasets from class: Do not reuse datasets used in examples / homework in the class. However, you are welcome to extend datasets. For instance, if you were interested in a more detailed analysis of the Pokemon dataset, you are welcome to use web scraping to add variables such as Pokemon type or evolution status to the existing dataset.

Project proposal

Due Monday, April 6 at 11:59pm US Eastern Time on Gradescope.

The proposal is a draft of the introduction section of your project as well as a data analysis plan and your dataset, and is comprised of sections 1 through 3 as detailed below.

Your write up and all typesetting must be done using R Markdown. All team members must contribute to the GitHub repository. Before you finalize your write up, make sure the printing of code chunks is turned off with the chunk option echo = FALSE.

When applicable you should use the functions and packages introduced in the course. However, we realize not all datasets and research questions are identical. Thus, use tidyverse syntax when possible, but if you use a package that we have not discussed in the course, we will be more lenient regarding tidyverse syntax (that being said, no dollar signs!). Remember to use good coding style and best visualization practices as discussed by Eric Monson.

You must turn in your proposal as a .pdf to Gradescope in order to receive credit.

Section 1 - Introduction

The introduction should introduce your general research question and your data (where it came from, how it was collected, what are the cases, what are the variables, etc.). This section should be no more than one page, excluding figures.

Section 2 - Data analysis plan

The data analysis plan should include:

  • The outcome (dependent, response, \(Y\)) and predictor (independent, explanatory, \(\mathbf{X}\)) variables you will use to answer your question.
  • Very preliminary exploratory data analysis, including some summary statistics and visualizations, along with some explanation on how they help you learn more about your data. (You can add to these later as you work on your project.)
  • The statistical method(s) that you believe will be useful in answering your question(s). (You can update these later as you work on your project.)
  • What results from these specific statistical methods are needed to support your hypothesized answer?

This section should be no more than two pages, excluding figures.

Section 3 - Data

Place your data in the data folder in your team’s repository, and then add the dimensions of your dataset and a codebook to the README in this folder. Finally, print out the output of glimpse of your data frame.

Total 20 pts
Introduction 6 pts
Data analysis plan 9 pts
Data and codebook 3 pts
Organization and code quality 2 pts

Make sure you only print the glimpse output of your dataset.

Main write-up

Due Wednesday, April 29 at 2:00pm US Eastern Time on Gradescope.

There is no page limit or requirement, but do be judicious in your write-up. The full write-up should build on your project proposal above. That is, you should build on the Markdown file from your project proposal (the entire write-up should be one document by the end of the course).

When applicable you should use the functions and packages introduced in the course. However, we realize not all datasets and research questions are identical. Thus, use tidyverse syntax when possible, but if you use a package that we have not discussed in the course, we will be more lenient regarding tidyverse syntax (that being said, no dollar signs!). Remember to use good coding style and best visualization practices as discussed by Eric Monson.

You must turn in your write-up as a .pdf to Gradescope in order to receive credit.

Sections 1 through 3

“Finalize” sections 1 through 3 with final visualizations, etc. These will not be re-graded, but subsequent sections will build on these so make sure they are set. It could be that you do not need to change anything from your project proposals – this is ok too!

Section 4 - Methods and Results

Use the remainder of your write up to showcase how you arrived at answers to your question using any techniques we have learned in this class (and some beyond, if you’re feeling adventurous). The goal is not to do an exhaustive data analysis (i.e., do not calculate every statistic and procedure you have learned for every variable), but rather let me know that you are proficient at asking meaningful questions and answering them with results of data analysis, that you are proficient in using R, and that you are proficient at interpreting and presenting the results. Focus on methods that help you begin to answer your research questions. Be sure to discuss any assumptions you are making and whether they are satisfied.

Section 5 - Discussion

This section is a one to two page conclusion and discussion. This will require a summary of what you have learned about your research question along with statistical arguments supporting your conclusions. Also, critique your own methods and provide suggestions for improving your analysis. Issues pertaining to the reliability and validity of your data and appropriateness of the statistical analysis should also be discussed here. A paragraph on what you would do differently if you were able to start over with the project or what you would do next if you were going to continue work on the project should also be included.

Total 35 pts
Methods and results 15 pts
Discussion 15 pts
Organization and code quality 5 pts

Summary slides

Due Wednesday, April 29 at 2:00pm US Eastern Time on Gradescope.

In addition to the write-up, your team must also create three (3) presentation slides that summarize and showcase your project (do not include a “title” slide; all slides should have content). Introduce your research question and dataset, showcase a visualization, and give us some conclusions. These slides should serve as a brief visual accompaniment to your write-up and will be graded for content and quality. For submission, convert these slides to a .pdf document to be submitted to Gradescope.

You must turn in your slides as a .pdf to Gradescope in order to receive credit.

Total 7 pts
Summary slides 7 pts

Overall notes

The project is very open ended. For instance, in creating a compelling visualization(s) of your data in R, there is no limit on what tools or packages you may use. You do not need to visualize all of the data at once. A single high quality visualization will receive a much higher grade than a large number of poor quality visualizations.

Before you finalize your write up, make sure the printing of code chunks is turned off with the option echo = FALSE.

Finally, pay attention to details in your write-up and presentation. Neatness, coherency, and clarity will count.

Total 13 pts
Repo, document organization, and team contributions 3 pts
Overall neatness and presentation style 10 pts

GitHub repository

You must turn in relevant components to Gradescope in order to receive credit.

In addition to your Gradescope submissions, we will be checking your GitHub repository. This repository should be contributed to equally by all team members and should include

  • RMarkdown file (formatted to clearly present all of your code and results) that will output the proposal and write-up in one document
  • Meaningful README file on the GitHub repository
  • Dataset(s) (in csv or RData format, in a /data folder)
  • Presentation (if using Keynote/PowerPoint/Google Slides, export to PDF and put in repo, in a /presentation folder)

Style and format does count for this assignment, so please take the time to make sure everything looks good and your data and code are properly formatted.

Tips

  • You’re working in the same repo as your teammates now, so merge conflicts will happen, issues will arise, and that’s fine! Commit and push often, and ask questions when stuck.

  • Review the grading breakdown and ask questions if any of the expectations are unclear.

  • Make sure each team member is contributing, both in terms of quality and quantity of contribution (we will be reviewing commits from different team members).

  • Code: In your write up your code should be hidden (echo = FALSE) so that your document is neat and easy to read. However your document should include all your code such that if I re-knit your Rmd file I should be able to obtain the results you presented. Exception: If you want to highlight something specific about a piece of code, you’re welcome to show that portion.

  • Teamwork: You are to complete the assignment as a team. All team members are expected to contribute equally to the completion of this assignment and group assessments will be given at its completion - anyone judged to not have sufficiently contributed to the final product will have their grade penalized. While different teams members may have different backgrounds and abilities, it is the responsibility of every team member to understand how and why all code and approaches in the assignment works.

Grading

The grade breakdown is as follows:

Total 75 pts
Introduction 6 pts
Data analysis plan 9 pts
Data and codebook 3 pts
Methods and results 15 pts
Discussion 15 pts
Summary slides 7 pts
Repo, document organization, and team contributions 3 pts
Overall neatness and presentation style 10 pts
Organization and code quality (project proposal) 2 pts
Organization and code quality (main write-up) 5 pts

Team peer evaluation: You will be asked to fill out a survey where you rate the contribution and teamwork of each team member. This survey may modify the individual grade received by each group member. It is possible to have your grade fall from an “S” to “U” if your individual contribution is not significant throughout the project’s duration.

Grading of the project will take into account the following:

  • Content - What is the quality of research and/or policy question and relevancy of data to those questions?
  • Correctness - Are statistical procedures carried out and explained correctly?
  • Writing and Presentation - What is the quality of the statistical presentation, writing, and explanations?
  • Creativity and Critical Thought - Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project?

A general breakdown of scoring is as follows:

  • 90%-100%: Outstanding effort. Student understands how to apply all statistical concepts, can put the results into a cogent argument, can identify weaknesses in the argument, and can clearly communicate the results to others.
  • 80%-89%: Good effort. Student understands most of the concepts, puts together an adequate argument, identifies some weaknesses of their argument, and communicates most results clearly to others.
  • 70%-79%: Passing effort. Student has misunderstanding of concepts in several areas, has some trouble putting results together in a cogent argument, and communication of results is sometimes unclear.
  • 60%-69%: Struggling effort. Student is making some effort, but has misunderstanding of many concepts and is unable to put together a cogent argument. Communication of results is unclear.
  • Below 60%: Student is not making a sufficient effort.

Late work policy

There is no late work accepted on this project. Be sure to turn in your work early to avoid any technological mishaps.

With the outbreak of the global coronavirus pandemic, we understand that not all students may be in the same timezone. With this understanding, however, we will be strict regarding late work with respect to the US Eastern Time Zone, and all components must be submitted to Gradescope by the deadline in order to receive credit.

If you do not turn in your final project on time, you will not pass the course.