Art: From Kawarazaki Shodo, Take.

Project objectives

The purpose of the open-ended project is for you to use your statistical toolkit to address an outstanding research project in a principled, reproducible way. Using a dataset of your choosing, identify an interesting hypothesis, identify appropriate statistical methods, carry out your analysis, and present your results in a reproducible report and oral presentation in a way that is accessible to allied researchers.

The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class (and beyond, if you like!) and apply them to a data set to analyze it in a meaningful way.

Logistics

The final project is due Wednesday, May 1st.

Note that there is no grace period for the final manuscript - this is because the project serves as this course’s final exam and is on the university final exam schedule. Please use your time wisely!

Note: your GitHub report repository and commit history will also be evaluated by the instructor. The GitHub repository must contain the reproducible R Markdown document corresponding to the submitted reports, and will be checked throughout the course of the case study.

You may either work on the project alone, or with up to one (1) other person. If have a three person lab group, you may work with your entire lab group. If you would like to work with another student but do not have one identified, please let me know and I will pair you up with other students in the same situation. Your teammate does not necessarily have to be in your lab group.

The expected length, including all graphs, tables, narrative, etc., is 4 - 6 pages; your report may not exceed 8 pages.

Learning objectives

  • Design an appropriate statistical analysis from scratch
  • Write and develop a statistically sound analysis plan
  • Gain independence as a practicing statistician to tackle an open-ended research question of your choice
  • Critically think about reasonable analysis approaches in the context of real-world data
  • Express statistical models clearly and correctly
  • Develop scientific writing skills by providing clear, concise, data-driven conclusions suitable for allied researchers

Dataset

The datasets used should meet the following criteria:

  • At least 500 observations
  • At least 10 columns
  • At least 6 of the columns must be useful and unique predictor variables.
    • Identifier variables such as “name”, “social security number”, etc. are not useful predictor variables.
    • If you have multiple columns with the same information (e.g. “state abbreviation” and “state name”), then they are not unique predictors.
  • At least one variable that can be identified as a reasonable response variable (the response variable can be quantitative or categorical).
  • A mix of quantitative and categorical variables that can be used as predictors.
  • Observations should reasonably meet the independence condition. Therefore, avoid data with repeated measures, data collected over time, etc.
  • You may not use data that has previously been used in any course materials, or any derivation of data that has been used in course materials.

Detailed report instructions

Create a GitHub repository and invite me (Yue-J).

Your report should contain the following components:

Introduction and data

This section includes an introduction to the project motivation, data, and research question. Describe the data and definitions of key variables. It should also include some exploratory data analysis. All of the EDA won’t fit in the paper, so focus on the EDA for the response variable and a few other interesting variables and relationships.

Grading criteria

The research question and motivation are clearly stated in the introduction, including citations for the data source and any external research. The data are clearly described, including a description about how the data were originally collected and a concise definition of the variables relevant to understanding the report. The data cleaning process is clearly described, including any decisions made in the process (e.g., creating new variables, removing observations, etc.) The explanatory data analysis helps the reader better understand the observations in the data along with interesting and relevant relationships between the variables. It incorporates appropriate visualizations and summary statistics.

Methodology

This section includes a brief description of your modeling process. Explain the reasoning for the type of model you’re fitting, predictor variables considered for the model including any interactions. Additionally, show how you arrived at the final model by describing the model selection process, interactions considered, variable transformations (if needed), assessment of conditions and diagnostics, and any other relevant considerations that were part of the model fitting process.

Grading criteria

The analysis steps are appropriate for the data and research question. The group used a thorough and careful approach to select the final model; the approach is clearly described in the report. The model selection process was reasonable, and addressed any violations in model conditions were discussed and/or fixed. The model conditions and diagnostics are thoroughly and accurately assessed for their model. If violations of model conditions are still present, there was a reasonable attempt to address the violations based on the course content.

Results

This is where you will output the final model with any relevant model fit statistics. Describe the key results from the model. The goal is not to interpret every single variable in the model but rather to show that you are proficient in using the model output to address the research questions, using the interpretations to support your conclusions. Focus on the variables that help you answer the research question and that provide relevant context for the reader.

Grading criteria

The model fit is clearly assessed, and interesting findings from the model are clearly described. Interpretations of model coefficients are used to support the key findings and conclusions, rather than merely listing the interpretation of every model coefficient. If the primary modeling objective is prediction, the model’s predictive power is thoroughly assessed.

Discussion

In this section you’ll include a summary of what you have learned about your research question along with statistical arguments supporting your conclusions. In addition, discuss the limitations of your analysis and provide suggestions on ways the analysis could be improved. Any potential issues pertaining to the reliability and validity of your data and appropriateness of the statistical analysis should also be discussed here. Lastly, this section will include ideas for future work.

Grading criteria

Overall conclusions from analysis are clearly described, and the model results are put into the larger context of the subject matter and original research question. There is thoughtful consideration of potential limitations of the data and/or analysis, and ideas for future work are clearly described.

Organization + formatting

This is an assessment of the overall presentation and formatting of the written report.

Grading criteria

The report neatly written and organized with clear section headers and appropriately sized figures with informative labels. Numerical results are displayed with a reasonable number of digits, and all visualizations are neatly formatted. All citations and links are properly formatted. If there is an appendix, it is reasonably organized and easy for the reader to find relevant information. All code, warnings, and messages are suppressed. The main body of the written report (not including the appendix) is no longer than 10 pages.