The goal of today’s lab is to practice forward and backward model selection. In addition to practice with model selection functions in R, you will manually conduct a backward selection procedure to better understand what occurs when you use model selection functions.
Go to the STA210-Sp19 organization on GitHub (https://github.com/STA210-Sp19). Click on the repo with the prefix lab-06-. It contains the starter documents you need to complete the warmup exercise.
Clone the repo and create a new project in RStudio Cloud.
Configure git by typing the following in the console.
library(usethis)
use_git_config(user.name="your name", user.email="your email")
When configuring Git, be sure to use the email address that is associated with your GitHub account.
If you would like your git password cached for a week for this project, type the following in the Terminal:
git config --global credential.helper 'cache --timeout 604800'
You will need to enter your GitHub username and password one more time after caching the password. After that you won’t need to enter your credentials for 604800 seconds = 7 days.
You will need the following packages for today’s lab:
library(tidyverse)
library(knitr)
library(broom)
library(leaps)
library(Sleuth3) #case1201 data
library(ISLR) #Hitters data
Currently your project is called Untitled Project. Update the name of your project to the title of today’s lab.
Before we introduce the data, let’s warm up with a simple exercise.
Pick one team member to update the author and date fields at the top of the R Markdown file. Knit, commit, and push all the updated documents to Github.
Now, the remaining team members who have not been concurrently making these changes on their projects should click on the Pull button in their Git pane and observe that the changes are now reflected on their projects as well.
There are two datasets for this lab.
The dataset for Part I contains the SAT score (out of 1600) and other variables that may be associated with SAT performance for each of the 50 states in the U.S. The data is based on test takers for the 1982 exam. The following variables are in the dataset:
SAT
: average total SAT scoreState
: U.S. StateTakers
: percentage of high school seniors who took examIncome
: median income of families of test-takers ($ hundreds)Years
: average number of years test-takers had formal education in social sciences, natural sciences, and humanitiesPublic
: percentage of test-takers who attended public high schoolsExpend
: total state expenditure on high schools ($ hundreds per student)Rank
: median percentile rank of test-takers within their high school classesThe dataset for Part II contains the performance statistics and salaries of Major League Baseball players in the 1986 and 1987 seasons. The data is in the Hitters
dataset in the ISLR
package. Type ?Hitters
in the console to see the variables names and their definitions.
For the first part of the lab, you will return to the model selection activity you started in class using the SAT data. The data is in the case1201
data frame in the Sleuth3
package.
sat_scores <- case1201 %>%
select(-State)
full_model <- lm(SAT ~ ., data = sat_scores)
m1 <- lm(SAT ~ Income + Years + Public + Expend + Rank, data = sat_scores)
m2 <- lm(SAT ~ Takers + Years + Public + Expend + Rank, data = sat_scores)
m3 <- lm(SAT ~ Takers + Income + Public + Expend + Rank, data = sat_scores)
m4 <- lm(SAT ~ Takers + Income + Years + Expend + Rank, data = sat_scores)
m5 <- lm(SAT ~ Takers + Income + Years + Public + Rank, data = sat_scores)
m6 <- lm(SAT ~ Takers + Income + Years + Public + Expend, data = sat_scores)
What is the best 5-variable model? Display the model output.
Use the regsubsets
function to perform backward selection. What is the final model when \(Adj. R^2\) is the selection criterion? Display the coefficients and the \(Adj. R^2\) of the final model. This should be the same result you got in Exercise 1.
What is the final model when \(BIC\) is the selection criterion? Display the coefficients and the \(BIC\) of the final model.
Consider the comparisons made in the previous exercise. Are these differences what you would expect given the selection criteria used? Briefly explain.
The data for this part of the lab is the Hitters
dataset in the ISLR
package. Your goal is to fit a regression model that uses the performance statistics of baseball players to predictor their salary. There are 19 potential predictor variables, so you will use the regsubsets
function to conduct forward selection to choose a final model.
Read through the data dictionary for the Hitters
dataset. You can access it by typing ?Hitters
in the console. What is the difference between the variables HmRun
and CHmRun
?
Some observations have missing values for Salary
. Filter the data, so only observations that have values for Salary
are included. You will use this filtered data for the remainder of the lab.
Fill in the code below to conduct forward selection and save the results in an object called sel_summary
(selection summary).
The nvmax
option indicates the maximum-sized variable subsets to consider in the model selection.
regfit_forward <- regsubsets(_______, ________, method="forward", nvmax = 19)
sel_summary <- summary(_______)
sel_summary
contains the summary statistics for the best fit model containing \(k\) predictors, where \(k = 1, \ldots, 19\). The object sel_summary
is a list object, so it is cumbersome to extract the relevant summary statistics. Therefore, you can create a data frame called summary_stats
such that each row represents the best fit model with \(k\) predictors and each column is a summary statistic. For example, the second row contains the summary statistics of the best fit model that contains 2 variables.Fill in the code below to create the data frame summary_stats
that includes the \(BIC\), \(R^2\), \(Adj. R^2\), and residual sum of squares (RSS) for each model in sel_summary
. The data frame summary_stats
will also include the column np
, the number of predictors in the model represented on each row.
summary_stats <- data.frame(bic = sel_summary$bic,
adjr2 = _______,
rsq = _______,
rss = _______) %>%
mutate(np = row_number()) #number of variables
See the ggplot2 documentation for code to add a vertical line.
summary_stats
data frame to plot \(BIC\) versus the number of predictors. Include a vertical line on your plot that shows the number of predictors for the overall final model you would select based on \(BIC\). Be sure your plot has clear and informative title and axes labels.
You can fill in the code below with either max
or min
to find the number of predictors in the final model selected based on \(BIC\).
np_bic <- summary_stats %>%
filter(bic == _____(bic)) %>%
select(np) %>%
pull()
summary_stats
data frame to plot \(Adj. R^2\) versus the number of predictors. Include a vertical line on your plot that shows the number of predictors for the final model you would select based on \(Adj. R^2\). Be sure your plot has clear and informative title and axes labels.
summary_stats
data frame to plot \(R^2\) versus the number of predictors. Include a vertical line on your plot that shows the number of predictors for the final model selected based on \(R^2\). Be sure your plot has clear and informative title and axes labels.
Should \(R^2\) be used as a model selection criterion? Briefly explain why or why not using your answers to Exercises 11 - 13.
Choose a final model to predict a baseball player’s Salary
from his performance statistics. Display the variables, their coefficients, and the summary statistics from the summary_stats
data frame for this model.
Part II of this lab was inspired by Lab 6.5 in An Introduction to Statistical Learning and Variable Selection in Regression.