Data
For your project, pick one of the following datasets to use for your analysis.
- Ames Housing All residential home sales in Ames, Iowa between 2006 and 2010
- Body Measurements Body girth measurements and skeletal diameter measurements for 507 physically active individuals.
- Pokemon Data on 75 Pokémon evolutions spread across four species.
- U.S. Counties Data for 3083 counties in the United States,
To load your dataset, first download the ‘project.zip’ file in the Resources section of Sakai. Next, in RStudio, click ‘Upload’ in the file explorer and upload your data and markdown files. Navigate to where your files are saved using the file explorer and then select ‘Session’ -> ‘Set Working Directory’ -> ‘To Files Pane Location’. Use the load() function to load your data in RStudio. For example
Stages of the project
You will complete this project in two stages:
- Stage 1: Proposal (25 points)
- Stage 2: Final presentation (75 points)
The remainder of this document outlines the requirements and expectations for both stages of the project. You should read the entire document before getting started. The requirements and expectations for Stage 1 will only make sense in context of those for Stage 2.
Stage 1: Proposal (25 points)
Content
Your proposal should contain the following:
Title: (1 point) Choose an appropriate title for your project.
Data: (3 points) Describe your dataset and discuss your motivation for choosing it.
Research questions: (6 points) Come up with three research questions that you want to answer using your data. You should phrase your research questions in a way that matches up with the scope of inference that is allowed by your data. Two of your questions should involve at least three variables. These questions can be based on the existing variables, but you are also free to create new variables from the data. You will have the option to update / revise / change these questions when doing Stage 2 of the project.
EDA: (9 points) Perform an exploratory data analysis that addresses each of the three research questions you outlined above. Your EDA should contain numerical summaries and visualizations. Each R output and plot should be accompanied by a brief interpretation.
Timeline: (3 points) Sketch out a timeline for the work you will do to complete this project. Be as detailed and precise as possible. And be realistic – discuss course schedules, travel plans, etc.
Teamwork: (3 points) Describe in detail how you will divide the work between team members and what aspects of the project you will complete together as a team. Note that during the final presentation each member needs to be able to answer questions about all aspects of the work, regardless of whether they took the lead on that section or not.
Grading
Your proposal will be graded out of 25 points (as outlined above), and will make up 25% of your overall project score.
The following will result in deductions:
- Late: -1 points for each day late
- Reproducibility issues, requiring to make changes to the R Markdown file to knit the document: -3 points
Stage 2: Final presentation (75 points)
Content for the RMD/HTML file
Introduction: Outline your main research question(s), your motivation for choosing them and explain how your analyses address these questions. These questions can be updated from Stage 1 and should only focus on the analysis you completed for your project.
EDA: Do some exploratory data analysis to tell an “interesting” story about your dataset. Instead of limiting yourself to relationships between just two variables, broaden the scope of your analysis and employ creative approaches that evaluate relationships between two variables while controlling for another.
Inference: Use one of your research questions (or come up with a new one depending on feedback from the proposal) that can be answered with a hypothesis test or a confidence interval. This question could be used to shed some light on your choice of the ‘best’ linear model. Carry out the appropriate inference task to answer your question.
Modeling: Develop a multiple linear regression model to predict a numerical variable in the dataset. This model should start with a minimum of 4-5 explanatory variables, but you are welcome to use more than that. In order to assess how well your modelfits the data, you will need to set aside some of the observations to test how your model performs. To do this replace “your data” with the name of your dataset below.
project.data <- your data
n.obs <- dim(project.data)[1]
train.index <- sample(1:n.obs,floor(.8*n.obs),replace=FALSE) # randomly select 80% of obs for training
project.train <- project.data[train.index,] # Data for training the model
project.test <- project.data[-train.index,] # Data for testing the model's accuracy
Using the training data (project.train), start with all the variables in the model and use backward selection with adjusted R-squared to find the ‘best’ model. Below is a function that will compute the adjusted-\(R^2\) values for single step of the backwards selection process. For arguments, the function takes a data frame with the variables for that step and the name of the response variable in quotations.
single.step.backwards <- function(data,response){
resp.indx <- which(names(data)==response)
y <- data[,resp.indx]
X <- data[,-resp.indx]
n.pred <- dim(X)[2]
for(i in 1:n.pred){
print(paste0("Variable ", names(X)[i]," removed: Adjusted R-squared = ",
round(summary(lm(y~.,data=as.data.frame(X[,-i])))$adj.r.squared,5)))
}
}
Prediction: To assess how accurate your model is, compute the Root Mean Square Error (RMSE) for your testing data (project.test). The RMSE estimates the average difference between your model’s predictions and the actual observed values, and is given by the formula \[RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2}}.\] To compute the RMSE for your test data, replace “response_var” with the name of the response variable for your dataset in the following code.
predictions.test <- predict(model.best,project.test)
y <- project.test$response_var
mse <- mean((y-predictions.test)^2)
rmse <- sqrt(mse)
Lastly, choose a point from the test dataset and use the predict function to calculate the predicted value and the corresponding prediction interval.
Conclusion: A brief summary of your findings from the previous sections without repeating your statements from earlier as well as a discussion of what you have learned about the data and your research question(s). You should also discuss any shortcomings of your current study (either due to data collection or methodology) and include ideas for possible future research.
Grading
Your final project will be graded out of 75 points, and will make up 75% of your overall project score.
Grading of the project will take into account:
- Correctness: Are the procedures and explanations correct?
- Presentation: Are your slides well organized and your results clearly presented?
- Content/Critical thought: Did you think carefully about the problem?
- Tidyness: Is your code well organized?
Your team scores will be based on the following components:
- 25 points - PowerPoint slides
- 25 points - presentation
- 25 points - code
Submission
Due date - night (11:55 PM) before the presentation day.
Online on Sakai under Assignments. These will be time stamped, and late penalty will be applied based on the time stamp. Only one submission per team required.
- R Markdown file (.Rmd)
- HTML output (.html)
- PowerPoint slides
We will download your R Markdown file and run your code to confirm reproducibility of your work. Grading will be based on the document we compile, so make sure that your R Markdown file contains everything necessary to compile your entire work.
Teamwork and grading
All team members must be present at the presentation session. Failure to do so will result in a 0 on the project for the absent team member.
Note that each student must complete the project and score at least 30% of total possible points on the project in order to pass this class.
Honor code
You may not discuss this project in any way with anyone outside your team, besides the professor and TAs. Failure to abide by this policy will result in a 0 for all teams involved.
Tips
This project is an opportunity to apply what you have learned about descriptive statistics, graphical methods, correlation and regression, and hypothesis testing and confidence intervals.
The goal is not to do an exhaustive data analysis i.e., do not calculate every statistic and procedure you have learned for every variable, but rather to show that you are proficient at using R at a basic level and that you are proficient at interpreting and presenting the results.
You might consider critiquing your own method, such as issues pertaining to the reliability of the data and the appropriateness of the statistical analysis you used within the context of this specific data set.