The goal of the final project is to apply what you’ve learned in this course to conduct a statistical analysis. It should be an in-depth regression analysis of a question that interests your group. This question may come from one of your other courses, your research interests, your future career interests, etc.

The project will consist of

  • Proposal: due Thu, March 5 at 11:59p EDT
  • Analysis: due Thu, April 16 at 11:59p EDT
  • Presentation: due Mon, April 27 at 11:59p EDT
    • Watch presentations: Tue, April 28 - Wed, April 29
  • Final write up: due Wed, Apr 29 at 11:59p

Data

It is best to start with the question of interest and finding the data second. As you’re looking for data, keep in mind your regression analysis must be done using R. Once you find a data set, you should make sure you are able to load it into RStudio, especially if it is in a format we haven’t used in class before. If you’re having trouble loading your data set into RStudio, ask for help as soon as possible, so you can make any necessary adjustments before the project proposal is due.

In order for you to have the greatest chance of success with this project it is important that you choose a manageable dataset. This means that the data should be readily accessible and large enough that multiple main effects and interactions can be explored for your model. Your dataset must have at least 100 observations and at least 10 variables (exceptions can be made but you must speak with me first). The data set should include both quantitative and categorical variables.

You are not permitted to reuse datasets used in examples/homework/labs in class.

Help finding data

The Data Visualization Services team (located in Bostock library) has written a guide for finding data for a regression analysis. Please visit R Data Resources for Regression Analysis for guidelines to consider as you search for data along with suggestions for potential data sources.

Other sources that may be helpful:

Components

Proposal (10 pt): due Thu, March 5 at 11:59p

There are two main purposes of the project proposal:

  • To help you think about the project early, so you can get a head start on finding data, reading relevant literature, thinking about the questions you wish to answer, etc.
  • To ensure that the data you wish to analyze, methods you plan to use, and the scope of your analysis are feasible and will help you be successful for this project.

You will address the following in your proposal:

Narrative

  • What is the research question you wish to explore? This includes an introduction to the subject matter you’re investigating, the motivation for the for your research question (citing any relevant literature), and your hypotheses regarding the research question of interest.
  • Describe the dataset you wish to explore. This includes describing the observations in the data, a general description of the variables, and how the data was originally collected (not how you found the data but how the original curator of the data collected it.)

The Data

  • Use the glimpse function print a summary of the data frame at the end of your proposal.
  • Place your data in the /data folder, and add dimensions (number of observations and variables) and the data dictionary (a description of every variable in the dataset) to the README in the folder. Make sure the data dictionary is neatly formatted and easy to read. If your dataset has a lot of variables (> 20), you can include 10 of the key variables in the data dictionary in the README with a link to the original data dictionary for all of the variables.

Submission

To give you some experience with the type of workflow you’ll experience in practice, you will submit the proposal and all other aspects of the project in your GitHub repo. You do not need to submit anything for the project on Gradescope. We will provide comment and feedback as an “issue” in your GitHub repo.

Grading

Total 10 pt
Narrative 5 pt
Data (glimpse & dataset in data folder) 1 pt
Data dictionary 2 pt
Organization and formatting 2 pt

Analysis (20 pt): due Thu, April 16 at 11:59p EDT

For this portion of the project, you will conduct the regression analysis and begin to derive your final conclusions. This will become the main portion of your final write up. The regression analysis should include the following:

  • Statement of the research question and modeling objective (prediction, inference, etc.)
  • Description of the data and response variable
  • Exploratory data analysis - univariate, bivariate, multivariate (i.e. visualizing potential interaction terms)
    • The EDA should include visualizations (with informative titles and axes labels) and summary statistics as you describe the main results from the visualizations.
  • Brief explanation of the modeling approach and why you chose the model type. This should include the model type, model selection procedure, and interaction terms you’re considering for the final model.
  • Output of the final model. Note: The final model will be the result of numerous iterations and trying different models. You can include the other models you consider in the “Additional work” section at the end of the analysis write up.
  • Discussion of the assumptions for the final model
  • Interpretations and interesting findings from the model coefficients
  • Additional work of other models or analysis not included in the final model.

The regression analysis should be written as a report. Write your analysis in a .Rmd file called “analysis.Rmd” in your project folder on GitHub and output a PDF. Use proposal.Rmd to see an example YAML.

Presentation (20 pt): due Mon, April 27 at 11:59p EDT

In lieu of in-class presentations, we will share presentations on Sakai. A few of your classmates, your TAs, and Professor Tackett will watch your presentation and post questions.

The presentation will consist of two components: (1) Slide deck and (2) Video presentation OR write up.

Slide deck

Create a slide deck presenting the main findings of your analysis. The slide deck should have no more than 6 content slides + 1 title slide. Here is a suggested outline as you think through the slides; you do not have to use this exact format for the 6 slides.

  • Title Slide
  • Slide 1: Introduce the topic and motivation
  • Slide 2: Introduce the data
  • Slide 3: Highlights from EDA
  • Slide 4: Final model
  • Slide 5: Interesting findings from the model
  • Slide 6: Conclusions + future work

You can use the software of your choice to create your slide deck. Save your slide deck as PDF or provide a link to view your slides online (e.g. in Google Slides). Be sure you grant the correct access permissions, so Professor Tackett, the TAs, and your classmates have access to your slides.

Video or Write up

You will create a video presentation OR write-up to accompany your slide deck. One objective of the project is to give you experience presenting a statistical analysis through writing and in an oral presentation. Therefore, I strongly encourage your group to create a video presentation, if possible.

Video presentation

The video should be no more than 10 minutes (most presentations are about 8 - 10 minutes). For the presentation, you can speak over your slide deck, similar to the lecture content videos. I recommend using Zoom to record your presentation; however, you can use whatever platform works best for your group. Below are a few resources to help you record video presentations:

You will post the presentation video in Warpwire, which is accessible from the the course Sakai site (bottom of the left-hand tool bar). To post your video on Warpwire:

  • Click the Warpwire tab in the course Sakai site.
  • Click the “+” and select “Upload files”.
  • Locate the video on your computer and click to upload.
  • Once you’ve uploaded the video to Warpwire, click to share the video and make a copy of the video’s URL. You will need this when you post the video in the discussion forum.

Presentation write up

If it is not feasible for your group to make a video, you can make a write up (up to two pages singled- spaced) to accompany your slide deck. Think of the write up as a transcript of what you would say if you were presenting your results in class or in a video. Save your write up as a PDF. You will upload this write up with your slide deck to the discussion forum in Sakai.

Posting your presentation

Each team will post their slide deck + video or write up in the discussion forum in Sakai. To post your presentation:

  • Click on the Recordings & Discussion tab in the course Sakai site.
  • Click on the topic for your lab section.
  • Click on the conversation with your team name. Then, click to edit the conversation.
  • If you have a video presentation:
    • Click the Warpwire icon (between the flag and shopping cart icons).
    • Select your video, then click “Insert 1 item.” This will embed your video in the conversation.
    • Under the video, paste the URL to your video.
    • You’re done! You do not need to post your slide deck.
  • If you have a write up:
    • Click “Add Attachments”.
    • Add your slide deck and your write up.
    • You’re done!

The presentation must be posted on Sakai by Monday, April 27 at 11:59p EDT.

Presentation Grading

The presentation will be graded based on the following:

  • Slide deck (7 pt): The slides are neatly organized, professional, and highlight the main aspects of the project. The slides are easy to read and do not contain excessive text. There are no more than 6 content slides.
  • Video or write up (7 pt): The video or write up is professional, organized, and presents a clear narrative. The video is no longer than 10 minutes or the write up is no longer than two pages single-spaced.
  • Content (6 pt): The analysis content covered in the slides + video or write up is clear and accurate.

Watch + comment on presentations (10 pt): Tue, April 28 - Wed, April 29

Each student will be assigned 3 presentations to watch. After watching the group’s video or reading their slide deck and write up, click “Reply” to post a question for the group. You may not post a question that’s already been asked on the discussion thread. Additionally, the question should be (i) substantive (i.e. it can’t be “Why did you use a histogram instead of box plot”?), (ii) demonstrate your understanding of the content from the course, and (iii) relevant to that group’s specific presentation, i.e demonstrating that you’ve watched the presentation.

You may start posting questions at Tue, April 28 at 12a EDT. All questions must be posted by Wed, April 29 at 11:59p EDT.

This portion of the project will be assessed individually.

Final Write Up (35 pt): due Wed, Apr 29 at 11:59p

The goal of the final write up is to demonstrate your ability to use regression analysis to answer meaningful questions, your proficiency in R, and your proficiency interpreting and presenting results.

The final write up should be no more than 15 pages. This does not include the Additional Work section.

The final write up should be written as it if will be read by a business or research colleague interested in understanding the main results from your analysis. This means your write up should focus on the main conclusions and interesting findings that you derive from your analysis. It should not just be a list of every model you tried and interpretation of every model coefficient.

You can use the following sections to help organize your write up:

  • Section 1: Introduction
    • This section includes an introduction to the project motivation, data, and research question. Describe the data and definitions of key variables. It should also include some exploratory data analysis. All of the EDA won’t fit in the paper, so focus on the EDA for the response variable and a few other interesting variables and relationships.
  • Section 2: Regression Analysis
    • This section includes a brief description of your modeling process. Describe how you chose the modeling approach, how you conducted model selection, interactions you considered, and any variable transformations. This is also where you will output the final model and include a brief discussion of the model assumptions, diagnostics, and any model fit statistics (e.g. \(R^2\), AUC, etc.)
  • Section 3: Discussion
    • This section includes any relevant prediction and/or conclusions from your model. This should not just be a list of coefficient interpretations but rather use the interpretations from the model to support your conclusions. Remember, you are sharing a narrative with a business or research colleague who could potentially use your model to make decisions. They want to understand the practical conclusions and implications of the model results.
  • Section 4: Limitations
    • This section includes discussion about issues pertaining to the reliability and validity of your data or appropriateness of the regression analysis. This can be 1 - 2 paragraphs on what you would do differently if you were able to redo the project or next steps if you could continue working on the project.
  • Section 5: Conclusion
    • This section includes a summary of your key results any final points you want the reader to learn. It can also include ideas about future work.
  • Section 6: Additional Work
    • This section includes anything that is not included in the main body of the paper. This could be additional EDA, other models you’ve tried, additional analysis, etc. There is no page limit on the additional work, but it should still be neatly organized and easy for the reader to navigate.

Put the final write up in a .Rmd file called “final-writeup.Rmd” in your project folder on GitHub, and output it as a PDF. The document should be neatly organized, and all code and warning messages should be suppressed, i.e. not visible in the PDF. See Formatting Guidelines & Tips for code and tips to help you format your write up.

Final Write up Grading

The final write up will be graded based on the following:

  • Analysis (15 pt): The analysis steps are appropriate for the data and research question. The group used a thoughtful approach to select the final model that took into account potential interaction effect and addressed violations in assumptions. The model assumptions and diagnostics are thoroughly and accurately assessed. If violations of model assumptions still exist, there was a reasonable attempt to address them, i.e. based on what we’ve learned this semester.

  • Discussion (10 pt): The model fit is clearly assessed, and interesting findings from the model are clearly described. Interpretations of model coefficients are used to support the key findings and conclusions. If the primary modeling objective is prediction, the model’s predictive power is assessed.

  • Limitations & Conclusion ( 5 pt): Overall conclusions from analysis are clearly described. The group has thoughtfully considered potential limitations of their data or analysis and presented potential ideas to address them.

  • Organization & Formatting (5 pt) : The final write up is neatly organized with clear section headers and appropriately sized figures with informative labels. All code, warnings, and messages are suppressed. Overall, the document would be presentable in a business or research setting. See Formatting Guidelines & Tips for code and tips to format your document.

Team peer evaluation (5 pt) due Fri May 1

You will be asked to fill out a survey where you rate the contribution and teamwork of each team member out of 5 points. You will additionally report a contribution percentage for each team member. Filling out the survey is a prerequisite for getting credit on the team member evaluation. If you are suggesting that an individual did less than 20% of the work, please provide some explanation. If any individual gets an average peer score indicating that they did less than 10% of the work, this person’s project grade will be assessed accordingly.

Overall Grading

Total 100 pt
Proposal 10 pt
Analysis 20 pt
Final Write up 35 pt
Presentation 20 pt
Watch + comment on presentations 10 pt
Team peer evaluation 5 pt

The project will be graded based on the following criteria:

  • Consistency: Did you clearly answer the question of interest?
  • Clarity: Can the audience easily understand your analysis process and any sort of conclusions/arguments you make?
  • Relevancy: Did you use the appropriate statistical techniques to address your question? Was your analysis thorough (e.g. did you consider interactions in addition to main effects?)?
  • Interest: Did you attempt to answer a challenging and interesting question rather than just calculating a lot of descriptive statistics and simple linear regression models?
  • Organization: Is your write up and presentation organized in a way that is neat and clear for the audience to understand?

A general breakdown of scoring is as follows:

  • 90%-100%: Outstanding effort. Student understands how to apply all statistical concepts, can put the results into a cogent argument, can identify weaknesses in the argument, and can clearly communicate the results to others.
  • 80%-89%: Good effort. Student understands most of the concepts, puts together an adequate argument, identifies some weaknesses of their argument, and communicates most results clearly to others.
  • 70%-79%: Passing effort. Student has misunderstanding of concepts in several areas, has some trouble putting results together in a cogent argument, and communication of results is sometimes unclear.
  • 60%-69%: Struggling effort. Student is making some effort, but has misunderstanding of many concepts and is unable to put together a cogent argument. Communication of results is unclear.
  • Below 60%: Student is not making a sufficient effort.

Late penalty:

  • Late, but within 24 hours of due date/time: -20% (only applies to written portion, there is no option to do your presentation later)
  • Any later: no credit

Getting started

  • Go to the course organization on GitHub: https://www.github.com/sta210-sp20

  • Find the repo starting with project and that has your team name at the end (this should be the only project repo available to you).

  • Follow the usual instructions for cloning a new project in RStudio

Tips

  • You’re working in the same repo as your teammates now, so merge conflicts will happen, issues will arise, and that’s fine! Commit and push often, and ask questions when stuck.

  • Review the grading guidelines and ask questions if any of the expectations are unclear.

  • Make sure each team member is contributing, both in terms of quality and quantity of contribution (we will be reviewing commits from different team members).

  • Set aside time to work together both in the same location and remotely.

  • When you’re done, review the .md document on GitHub to make sure you’re happy with the final state of your work.

  • Code: In your write up your code should be hidden (echo = FALSE) so that your document is neat and easy to read. However your document should include all your code such that if I re-knit your Rmd file I should be able to obtain the results you presented. Exception: If you want to highlight something specific about a piece of code, you’re welcomed to show that portion.

  • Teamwork: You are to complete the assignment as a team. All team members are expected to contribute equally to the completion of this assignment and group assessments will be given at its completion - anyone judged to not have sufficient contributed to the final product will have their grade penalized. While different teams members may have different backgrounds and abilities, it is the responsibility of every team member to understand how and why all code and approaches in the assignment works.