In this lab, you will put together everything you’ve learned thus far. Unlike previous lab assignments, your lab write up will be in the form of a small report (rather than numbered exercises). Though this analysis will not be as in-depth as your analysis in the final project, this assignment will give your group practice organizing the results of a statistical analysis to tell a complete narrative.

You will also practice imputing missing data and using k-fold cross validation to assess your model’s performance on test data.

Getting Started

Go to the STA210-Sp19 organization on GitHub (https://github.com/STA210-Sp19). Click on the repo with the prefix lab-09-. It contains the starter documents you need to complete the lab.
Clone the repo and create a new project in RStudio Cloud.
Configure git by typing the following in the console.

⊕When configuring Git, be sure to use the email address that is associated with your GitHub account.

library(usethis)
use_git_config(user.name="your name", user.email="your email")

Password caching

If you would like your git password cached for a week for this project, type the following in the Terminal:

git config --global credential.helper 'cache --timeout 604800'

You will need to enter your GitHub username and password one more time after caching the password. After that you won’t need to enter your credentials for 604800 seconds = 7 days.

Packages

You will need the following packages for today’s lab:

library(tidyverse)
library(dslabs)
# Fill in other packages as needed

Project name:

Currently your project is called Untitled Project. Update the name of your project to the title of today’s lab.

Warm up

Before we introduce the data, let’s warm up with a simple exercise.

YAML:

Pick one team member to update the author and date fields at the top of the R Markdown file. Knit, commit, and push all the updated documents to Github.
Now, the remaining team members who have not been concurrently making these changes on their projects should click on the Pull button in their Git pane and observe that the changes are now reflected on their projects as well.

Data

The data for this lab is the gapminder dataset in the dslabs package. This dataset contains health and income data for 184 countries during the years 1960 to 2016. After loading the dslabs package, you can type ?gapminder in the console to to see the variables in the dataset.

You will only use data from 2011 in this lab.

Exercises

The goal of this analysis is to build a regression model that could be used to predict a country’s gross domestic product (gdp) using the other characteristics included in the data.

Introduction

Brief introduction of the data and the research question

Exploratory Data Analysis

At a minimum, your exploratory data analysis should include the following:

Analysis of each variable
Dealing with missing values using imputation methods
Analysis of the relationships between variables
Discussion of any potential transformations, if needed

Regression Model

At a minimum, the discussion for the final regression model should include the following:

Brief discussion about the type of model you used (multiple linear regression, logistic, multinomial logistic regression) and why
Discussion of any transformations on the response and/or explanatory variables, if applicable
Display of the final model
Test of interesting interactions
Conclusions drawn from the model, including any interesting insights based on the model coefficients

Assumptions

At a minimum, the discussion of model assumptions should include the following:

Appropriate residual plots
Check for influential points
Check for multicollinearity
Discussion of whether or not assumptions are met and how any issues may affect conclusions drawn from the model

Model Validation

At a minimum, the discussion of the model validation should include the following:

Results and discussion from a 5-fold cross validation

Conclusion

Brief summary of the conclusions drawn from the analysis.

Lab 09: Putting it All Together

due Wed, Apr 10 at 11:59p