Project proposal due Tuesday, March 23 at 11:59pm ET
Peer feedback due Wednesday, April 21 at 11:59pm ET
Report due Wednesday, April 28 at 12:00pm ET
The goal of this project is to demonstrate proficiency in data science techniques by conducting a novel analysis of a dataset of your own choosing or creation. The dataset may already exist, or you may collect your own data using a survey or by scraping the web.
The final project will be done with your lab groups.
The four deliverables for the final project are
No late projects are accepted. As per the syllabus, you and your team must complete all components of the final project to pass the course.
The grade breakdown is as follows:
Total | 100 pts |
---|---|
Project proposal | 5 pts |
Written report | 80 pts |
Repository | 5 pts |
Peer feedback | 10 pts |
To perform a successful analysis it is imperative that you choose a manageable dataset that can be analyzed using the tools we have learned in STA 199. This means that the data should be readily accessible, not contain too many missing values, and be large enough so that multiple relationships can be explored. Your dataset must have at least 500 observations and at least ten variables (or my approval). The dataset should include a rich mix of categorical, discrete numeric, and continuous numeric data. If you have any doubts or are having trouble please reach out to me.
All analyses must be done in RStudio and your final written report and analysis must be reproducible. This means that you must create an R Markdown document linked to a GitHub repository that will create your written report exactly upon knitting.
If you are using a dataset that comes in a format that we haven’t encountered in class (for instance, a .DAT file), make sure that you are able to load it into RStudio as this can be tricky depending on the source. Again, if you are having trouble, ask for help.
Reusing datasets from class: Do not reuse datasets or variations of datasets used in examples / homework. Do not use any datasets from Kaggle and Spotify. Also, you may not use data you analyzed in another course.
The resources below may be helpful for finding an interesting dataset but feel free to explore on your own.
The first stage of the final project is the project proposal. The proposal is designed as a check to make sure you choose a dataset that allows you to perform an interesting analysis using the tools we have developed in STA 199. Choose three substantially different datasets you are interested in analyzing. For each, identify the two components below.
Identify the source of the data, when and how it was originally collected, the cases, and a general description of relevant variables. Use the glimpse()
function to glimpse your data and include the output in your proposal.
Place the file containing your data in the data/
folder of your project repo.
Describe your research topic and provide a concise, well-written statement of your research question and hypotheses.
Submit the PDF of your proposal to Gradescope. The teaching team will provide feedback on your proposal and help you decide which dataset you should use for your final project. Project proposals should be one page maximum.
Your final report must be written using R Markdown. All team members must contribute to the GitHub repository, with regular meaningful commits / pushes. Before you finalize your report, make sure the printing of code chunks is turned off with the option echo = FALSE.
Your final report must match your GitHub repository exactly. The mandatory components of the report are as follows, but feel free to expand with additional sections as necessary. Your final written report should not exceed five pages inclusive of all tables and figures.
The written report is worth 80 points, broken down as follows:
Total | 80 pts |
---|---|
Introduction/data | 10 pts |
Methodology | 25 pts |
Results | 25 pts |
Discussion | 20 pts |
The introduction provides motivation and context for your research. Describe your topic (citing sources) and provide a concise, clear statement of your research question and hypotheses.
Then identify the source of the data, when and how it was collected, the cases, a general description of relevant variables.
The methodology section should include visualizations and summary statistics relevant to your research question. You should also justify the choice of statistical method(s) used to answer your research question.
Showcase how you arrived at answers to your research question using the techniques we have learned in class (and beyond, if you’re feeling adventurous). Provide only the main results from your analysis. The goal is not to do an exhaustive data analysis (calculate every possible statisic and perform every possible procedure for all variables). Rather, you should demonstrate that you are proficient at asking meaningful questions and answering them using data, that you are skilled in interpreting and presenting results, and that you can accomplish these tasks using R. More is not better.
This section is a conclusion and discussion. This will require a summary of what you have learned about your research question along with statistical arguments supporting your conclusions. You should critique your own methods and provide suggestions for improving your analysis and future work. Issues pertaining to the reliability and validity of your data and the appropriateness of the statistical analyses should also be discussed. Also include a brief paragraph on what you would do differently if you were able to start over with the project.
In addition to your Gradescope submissions, we will be checking your GitHub repository. This repository should have equal contribution by all team members and should include
data/
folder)Style and format does count for this assignment, so please take the time to make sure everything looks good and your data and code are properly formatted.
Critically reviewing others’ work is a crucial part of the scientific process, and STA 199 is no exception. You will be assigned a team to review and will be given read access to their project repo. You have until Wednesday, April 21 at 11:59pm ET to provide a detailed critique of the written report and data analysis via GitHub issues.
This review is intended to help you create a high quality final project, as well as give you experience reading and constructively critiquing the work of others.
Lab on Tuesday, April 20 will be devoted to peer review.
Ask questions if any of the expectations are unclear.
Code: In your write-up your code should be hidden (echo = FALSE
) so that your document is neat and easy to read. However your document should include all your code such that if I re-knit your Rmd file I should be able to obtain the results you presented. Exception: If you want to highlight something specific about a piece of code, you’re welcome to show that portion.
Merge conflicts will happen, issues will arise, and that’s fine! Commit and push often, and ask questions when stuck.
The project is very open ended. For instance, in creating compelling visualization of your data in R, there is no limit on what tools or packages you may use. You do not need to visualize all of the data at once. A single high quality visualization will receive a much higher grade than a large number of poor quality visualizations.
Make sure each team member is contributing, both in terms of quality and quantity of contribution (we will be reviewing commits from different team members).
All team members are expected to contribute equally to the completion of this assignment and group assessments will be given at its completion - anyone judged to not have sufficient contributed to the final product will have their grade penalized. While different teams members may have different backgrounds and abilities, it is the responsibility of every team member to understand how and why all code and approaches in the assignment works.
Pay attention to details in your write-up. Neatness, coherency, and clarity will count.
Write all R code according to the style guidelines discussed in class.
Grading of the project will take into account the following:
A general breakdown of scoring is as follows:
You will be asked to fill out a survey where you rate the contribution and teamwork of each team member by assigning a contribution percentage for each team member with justification. This will contribute to your final project grade.
There is no late work accepted on this project. Be sure to turn in your work early to avoid any technological mishaps.