R Markdown: All code used to generate the statistics and plots in your presentation should be submitted in an R Markdown document, which should be organized as outlined below. Download the template for the Project Stage 2 from Sakai Resources folder. There is no limit on the length of this document.
Introduction: The goal of your final project and presentation (ie stage 2), is to tell a compelling story based on the data analysis you performed. Introduce the overarching theme or idea that you will be investigating, and, within that framework, describe the questions you will be addressing with your analysis. Use your research from Stage 1 to provide background on your topic, and motivate why your analysis is relevant.
Analysis: The data analysis portion of your project will consist of a hypothesis test, an exploratory data analysis, and a multivariate linear model. You should use your research questions from Stage 1 to help you choose which variables you’ll use for each portion of the analysis. However, you are not required to answer all of your research questions from the previous part of the project.
Hypothesis Test: Use the inference() function from the labs to perform one of the hypothesis tests from Units 4 or 5. If the R output includes a confidence interval, interpret it along with the results of your test. If you conduct an ANOVA test, include and interpret the results of the post-hoc pair-wise tests. Make sure you address whether the necessary conditions for your inference are met. If they aren’t satisfied, still proceed with the test but make note of this when you state your findings.
EDA: Compare two variables in your dataset while controlling for a third (categorical) variable. This is typically done using nested side-by-side boxplots, or with a color coded scatterplot. Explain the relationship of the variables and compare it to any relevant EDA you performed in Stage 1 of the project.
Modeling: Develop a multiple linear regression model to predict a numerical variable in the dataset. This model should start with a minimum of 5 explanatory variables, but you are welcome to use more than that. The variables you choose should be related to the research interests stated in the Introduction, and you should provide a substantial discussion of your model’s output. For instance, what do the coefficient values tell you about the relationship between the explanatory variables and the response variable? When you performed your model selection, were you supprised that a particular variable wasn’t included in the model? If your model included the same variables from your hypothesis test or EDA, do you still see the same relationship between the explanatory and response variables?
Model Selection and Performance: In order to assess how well your model predicts new data, you will need to set aside some of the observations before fitting your model. Replace “your data” in the code below with the name of your data set and run it. This will create two separate data sets, project.train and project.test. The project.train data set contains 80% of the observations, and will be used to fit your model. The remaining 20% of the observations in project.test will be used to test your model’s predictions.
project.data <- your data
n.obs <- dim(project.data)[1]
train.index <- sample(1:n.obs,floor(.8*n.obs),replace=FALSE) # randomly select 80% of obs for training
project.train <- project.data[train.index,] # Data for training the model
project.test <- project.data[-train.index,] # Data for testing the model's accuracy
With the project.train data, start with all of the variables you’ve chosen to be in the model, and perform the backward selection process. Use \(R^{2}_{adj}\) for selction since your model will be used to make predictions. Below is a function that will compute the \(R^2_{adj}\) values for a single step of the backwards selection process. This function takes for arguments a data frame with the variables for that step, and the name of the response variable in quotations.
single.step.backwards <- function(data,response){
resp.indx <- which(names(data)==response)
y <- data[,resp.indx]
X <- data[,-resp.indx]
n.pred <- dim(X)[2]
if(n.pred > 1){
for(i in 1:n.pred){
print(paste0("Variable ", names(X)[i]," removed: Adjusted R-squared = ",
round(summary(lm(y~.,data=as.data.frame(X[,-i])))$adj.r.squared,5)))
}
}
else{
print("Model only contains one variable.")
}
}
You can use this single.step.backwards() function on the whole dataset or just the explanatory variables that you are interested in. For example, if you would like to run a single step of backwards elimination on the the whole ames dataset, where your response variable is ‘Lot.Area’, you could use the code below.
single.step.backwards(project.train,'Lot.Area')
Alternatively, for example, if you would like to run backwards elimination on the ames data set, where your response variable is ‘Lot.Area’ and you only wish to consider five explanatory variables (MS.SubClass, Street, Lot.Shape, Lot.Config, Roof.Style), you could use the code below. If the rules of backwards elimination suggested that you remove Lot.Area (they don’t) then you would run the single.step.backwards() function again, removing Lot.Area from the select().
single.step.backwards(project.train%>%select(MS.SubClass,Lot.Area,Street,Lot.Shape,Lot.Config,Roof.Style),'Lot.Area')
After you’ve selected the best model, fit it using the project.train data and save it as “model.best”. (The code below gives you an example of how you might do this. Replace the variables in the lm() function that correspond to your response variable and explanatory variables that were left over from the backwards elimination you performed.) Remember to create the necessary diagnostic plots for this model and determine if a linear model is appropriate. If you find your residuals are heavily sewed, fit another model replacing your response variable with the log transformed response (you don’t need to redo the backwards selection). Is there an improvement in the diagnostic plots?
model.best<-lm(response_variable ~ left_over_expl_variable1+ left_over_expl_variable2 + ..., data=project.train)
Prediction: To assess how accurate your model’s predictions are, compute the Root Mean Square Error (RMSE) for your testing data, project.test. The RMSE estimates the average difference between your model’s predictions and the actual observed values, and is given by the formula \[RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2}}.\] To compute the RMSE for your test data, replace “response_var” with the name of your response variable in the following code.
predictions.test <- predict(model.best,project.test)
y <- project.test$response_var
mse <- mean((y-predictions.test)^2)
(rmse <- sqrt(mse))
Lastly, choose a point from the test dataset and use the predict function to calculate the predicted value and the corresponding prediction interval.
Conclusion: Tie together the findings of your analysis with the ideas in your Introduction. Which questions were you able to answer, and what where your conclusions? Where there shortcomings in your data that prevented you from fully answering a particular question? If so, how could future studies fix this problem? How might the results of your analysis be used to motivate or inform future research?
Presentation: You group should put together a short PowerPoint-like presentation of your analysis to be presented during the final lab session. Keep you presentation under 10 minutes. You will be asked to stop once the time limit is reached. Your presentation needs to highlight each component of your project: hypothesis test, EDA, model and prediction; however, you should focus on your results and conclusions, and leave the details for your markdown file.
The Project Stage 2 and presentation will make up 75% of your overall project score. Stage 1 was worth 25%.
Grading of the project will take into account:
Submit the following documents on Sakai:
Only one submission per team is required. We will download your R Markdown file and run your code to confirm your work is reproducible. Grading will be based on the document we compile, so make sure that your R Markdown file contains everything necessary to compile your entire work.
All team members must be present at the presentation session. Failure to do so will result in a 0 on the project for the absent team member.
Note that each student must complete the project and score at least 30% of total possible points on the project in order to pass this class.
With the exception of myself and the TAs, you should not share your project code with anyone outside of your team. Failure to abide by this policy will result in a 0 for all teams involved.
This project is an opportunity to apply what you have learned about descriptive statistics, graphical methods, correlation and regression, and hypothesis testing and confidence intervals.
The goal is not to do an exhaustive data analysis i.e., do not calculate every statistic and procedure you have learned for every variable, but rather to show that you are proficient at using R at a basic level and that you are proficient at interpreting and presenting the results.
You might consider critiquing your own method, such as issues pertaining to the reliability of the data and the appropriateness of the statistical analysis you used within the context of this specific data set.