In this lab, you will put together everything you’ve learned thus far. Unlike previous lab assignments, your lab write up will be in the form of a small report (rather than numbered exercises). Though this analysis will not be as in-depth as your analysis in the final project, this assignment will give your group practice organizing the results of a statistical analysis to tell a complete narrative.
You will also practice imputing missing data and using k-fold cross validation to assess your model’s performance on test data.
Go to the STA210-Sp19 organization on GitHub (https://github.com/STA210-Sp19). Click on the repo with the prefix lab-09-. It contains the starter documents you need to complete the lab.
Clone the repo and create a new project in RStudio Cloud.
Configure git by typing the following in the console.
When configuring Git, be sure to use the email address that is associated with your GitHub account.
library(usethis)
use_git_config(user.name="your name", user.email="your email")
If you would like your git password cached for a week for this project, type the following in the Terminal:
git config --global credential.helper 'cache --timeout 604800'
You will need to enter your GitHub username and password one more time after caching the password. After that you won’t need to enter your credentials for 604800 seconds = 7 days.
You will need the following packages for today’s lab:
library(tidyverse)
library(dslabs)
# Fill in other packages as needed
Currently your project is called Untitled Project. Update the name of your project to the title of today’s lab.
Before we introduce the data, let’s warm up with a simple exercise.
Pick one team member to update the author and date fields at the top of the R Markdown file. Knit, commit, and push all the updated documents to Github.
Now, the remaining team members who have not been concurrently making these changes on their projects should click on the Pull button in their Git pane and observe that the changes are now reflected on their projects as well.
The data for this lab is the gapminder
dataset in the dslabs package. This dataset contains health and income data for 184 countries during the years 1960 to 2016. After loading the dslabs package, you can type ?gapminder
in the console to to see the variables in the dataset.
You will only use data from 2011 in this lab.
The goal of this analysis is to build a regression model that could be used to predict a country’s gross domestic product (gdp
) using the other characteristics included in the data.
Introduction
Brief introduction of the data and the research question
Exploratory Data Analysis
At a minimum, your exploratory data analysis should include the following:
Regression Model
At a minimum, the discussion for the final regression model should include the following:
Assumptions
At a minimum, the discussion of model assumptions should include the following:
Model Validation
At a minimum, the discussion of the model validation should include the following:
Conclusion
Brief summary of the conclusions drawn from the analysis.