Lab 06: Model Selection

due Wed, Mar 5 at 11:59p

The goal of today’s lab is to practice forward and backward model selection. In addition to practice with model selection functions in R, you will manually conduct a backward selection procedure to better understand what occurs when you use model selection functions.

Getting Started

library(usethis)
use_git_config(user.name="your name", user.email="your email")

When configuring Git, be sure to use the email address that is associated with your GitHub account.

Password caching

If you would like your git password cached for a week for this project, type the following in the Terminal:

git config --global credential.helper 'cache --timeout 604800'

You will need to enter your GitHub username and password one more time after caching the password. After that you won’t need to enter your credentials for 604800 seconds = 7 days.

Packages

You will need the following packages for today’s lab:

library(tidyverse)
library(knitr)
library(broom)
library(leaps)
library(Sleuth3) #case1201 data
library(ISLR) #Hitters data

Project name:

Currently your project is called Untitled Project. Update the name of your project to the title of today’s lab.

Warm up

Before we introduce the data, let’s warm up with a simple exercise.

YAML:

Data

There are two datasets for this lab.

Part I

The dataset for Part I contains the SAT score (out of 1600) and other variables that may be associated with SAT performance for each of the 50 states in the U.S. The data is based on test takers for the 1982 exam. The following variables are in the dataset:

  • SAT: average total SAT score
  • State: U.S. State
  • Takers: percentage of high school seniors who took exam
  • Income: median income of families of test-takers ($ hundreds)
  • Years: average number of years test-takers had formal education in social sciences, natural sciences, and humanities
  • Public: percentage of test-takers who attended public high schools
  • Expend: total state expenditure on high schools ($ hundreds per student)
  • Rank: median percentile rank of test-takers within their high school classes

Part II

The dataset for Part II contains the performance statistics and salaries of Major League Baseball players in the 1986 and 1987 seasons. The data is in the Hitters dataset in the ISLR package. Type ?Hitters in the console to see the variables names and their definitions.

Exercises

Part I

For the first part of the lab, you will return to the model selection activity you started in class using the SAT data. The data is in the case1201 data frame in the Sleuth3 package.

sat_scores <- case1201 %>%
  select(-State) 
  1. Manually perform backward selection using \(Adj. R^2\) as the selection criterion. To help you get started, the full model and the code for the first set of models to test are below. Show each step of the selection process. Display the coefficients and \(Adj. R^2\) of your final model.
full_model <- lm(SAT ~ ., data = sat_scores)
m1 <- lm(SAT ~ Income + Years + Public + Expend + Rank, data = sat_scores)
m2 <- lm(SAT ~ Takers + Years + Public + Expend + Rank, data = sat_scores)
m3 <- lm(SAT ~ Takers + Income + Public + Expend + Rank, data = sat_scores)
m4 <- lm(SAT ~ Takers + Income + Years + Expend + Rank, data = sat_scores)
m5 <- lm(SAT ~ Takers + Income + Years + Public + Rank, data = sat_scores)
m6 <- lm(SAT ~ Takers + Income + Years + Public + Expend, data = sat_scores)
  1. What is the best 5-variable model? Display the model output.

  2. Use the regsubsets function to perform backward selection. What is the final model when \(Adj. R^2\) is the selection criterion? Display the coefficients and the \(Adj. R^2\) of the final model. This should be the same result you got in Exercise 1.

  3. What is the final model when \(BIC\) is the selection criterion? Display the coefficients and the \(BIC\) of the final model.

  4. Compare the final models selected by \(Adj. R^2\) and \(BIC\).
    • Do the models have the same number of predictors? Briefly explain.
    • Are the same predictor variables in each model? Briefly explain.
  5. Consider the comparisons made in the previous exercise. Are these differences what you would expect given the selection criteria used? Briefly explain.

Part II

The data for this part of the lab is the Hitters dataset in the ISLR package. Your goal is to fit a regression model that uses the performance statistics of baseball players to predictor their salary. There are 19 potential predictor variables, so you will use the regsubsets function to conduct forward selection to choose a final model.

  1. Read through the data dictionary for the Hitters dataset. You can access it by typing ?Hitters in the console. What is the difference between the variables HmRun and CHmRun?

  2. Some observations have missing values for Salary. Filter the data, so only observations that have values for Salary are included. You will use this filtered data for the remainder of the lab.

  3. Fill in the code below to conduct forward selection and save the results in an object called sel_summary (selection summary).

The nvmax option indicates the maximum-sized variable subsets to consider in the model selection.

regfit_forward <- regsubsets(_______, ________, method="forward", nvmax = 19)
sel_summary <- summary(_______)
  1. The object sel_summary contains the summary statistics for the best fit model containing \(k\) predictors, where \(k = 1, \ldots, 19\). The object sel_summary is a list object, so it is cumbersome to extract the relevant summary statistics. Therefore, you can create a data frame called summary_stats such that each row represents the best fit model with \(k\) predictors and each column is a summary statistic. For example, the second row contains the summary statistics of the best fit model that contains 2 variables.

Fill in the code below to create the data frame summary_stats that includes the \(BIC\), \(R^2\), \(Adj. R^2\), and residual sum of squares (RSS) for each model in sel_summary. The data frame summary_stats will also include the column np, the number of predictors in the model represented on each row.

summary_stats <- data.frame(bic = sel_summary$bic, 
                       adjr2 = _______, 
                       rsq = _______,
                       rss = _______) %>%
  mutate(np = row_number()) #number of variables

See the ggplot2 documentation for code to add a vertical line.

  1. Use the data in the summary_stats data frame to plot \(BIC\) versus the number of predictors. Include a vertical line on your plot that shows the number of predictors for the overall final model you would select based on \(BIC\). Be sure your plot has clear and informative title and axes labels.
    • How does \(BIC\) change as the number of predictors increases?
    • How many predictors are in the final model selected based on \(BIC\)?

You can fill in the code below with either max or min to find the number of predictors in the final model selected based on \(BIC\).

np_bic <- summary_stats %>%
  filter(bic == _____(bic)) %>%
  select(np) %>% 
  pull()
  1. Use the data in the summary_stats data frame to plot \(Adj. R^2\) versus the number of predictors. Include a vertical line on your plot that shows the number of predictors for the final model you would select based on \(Adj. R^2\). Be sure your plot has clear and informative title and axes labels.
    • How does \(Adj. R^2\) change as the number of predictors increases?
    • How many predictors are in the final model selected based on \(Adj. R^2\)?
  2. Use the data in the summary_stats data frame to plot \(R^2\) versus the number of predictors. Include a vertical line on your plot that shows the number of predictors for the final model selected based on \(R^2\). Be sure your plot has clear and informative title and axes labels.
    • How does \(R^2\) change as the number of predictors increases?
    • How many predictors are in the final model selected based on \(R^2\)?
  3. Should \(R^2\) be used as a model selection criterion? Briefly explain why or why not using your answers to Exercises 11 - 13.

  4. Choose a final model to predict a baseball player’s Salary from his performance statistics. Display the variables, their coefficients, and the summary statistics from the summary_stats data frame for this model.

  5. Briefly explain why you chose the model in the previous exercise.
    • Which model selection criteria did you use (\(BIC\), \(Adj. R^2\), \(R^2\))? Why?
    • What other factors did you consider besides the value of the model selection criteria?

Acknowledgements

Part II of this lab was inspired by Lab 6.5 in An Introduction to Statistical Learning and Variable Selection in Regression.