Exam 1 Solutions

STA 211 Spring 2023 (Jiang)

library(tidyverse)
library(tidymodels)
dat <- read.csv("batting.csv")

Exercise 1 - 0 points

Instructions, data description, and Duke Community Standard

As noted in the instructions and course syllabus, any evidence of academic dishonesty will result in a failing grade for the course.

Exercise 2 - 0 points

Uploading of markdown document corresponding to exam

As noted in the instructions, failure to attach a .qmd file that matches what is in the exam repository will result in an automatic 10 point penalty.

Exercise 3 - 15 points

Consider two linear models: the first with salary as the response variable and total number of games played as the only predictor variable, and the second with salary as the response variable and both the total number of games played and the number of at bats as the two predictor variables.

Exercise 3.1

Interpret the intercept in the first model. Does this quantity make sense to interpret? Explain.

Exercise 3.2

Consider the slope coefficient corresponding to the total number of games played in the second model. Is this what most people might expect? Explain what is happening here. In your answer, specifically interpret the estimated slope parameters corresponding to the number of games played for both of your models in context of the data.

Exercise 3.3

Conduct a formal hypothesis test at the $\alpha$ = 0.05 level for the slope parameter corresponding to the total number of games played in the second of your two models.

Models, code, and output provided below for reference:

m3.1.1 <- lm(salary ~ G, data = dat)
m3.1.2 <- lm(salary ~ G + AB, data = dat)

tidy(m3.1.1)

# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept) 1746785.   169652.     10.3  9.42e-24
2 G              2820.      299.      9.44 2.38e-20

tidy(m3.1.2)

# A tibble: 3 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept) 2055549.   215289.     9.55  8.96e-21
2 G              -490.     1457.    -0.337 7.36e- 1
3 AB              880.      379.     2.32  2.04e- 2

Exercise 3.1

The expected salary for a player that plays in zero games is $1,746,785.20. Although it may make sense to interpret quantities regarding players that have zero games played (this is certainly possible and interpretable, e.g., for rookies before the start of a season), doing so here would be extrapolating outside of the range of games for our dataset (which specifically looks at players with at least twenty recorded games).

Exercise 3.2

For each additional game a player has played, we expect a $490.39 salary decrease, while holding the number of at-bats constant. This is not what most people expect, since intuitively it makes sense for more games played to result in a higher salary, which is what we see in the model with only number of games as a predictor (here, for each additional game a player has played, we would expect a $2,820.37).

This difference is because we are adjusting for at-bats in the second model. Given any value of at-bats, more games are actually associated with lower salaries. This might make more sense, because these players have fewer at-bats per game, which implies that they have less game time (players with higher at-bats per game might be sent out more because they’re better players, for instance).

Exercise 3.3

We are conducting a hypothesis test at the $\alpha = 0.05$ significance level for the null hypothesis $H_0: \beta_{G} = 0$ (that there is no linear relationship between games and salary while controlling for number of at bats) vs. the alternative $H_1: \beta_{G} \neq 0$ (there is such a relationship).

Our test statistic is -0.337, which follows a t distribution with 1050 degrees of freedom under $H_0$. This corresponds to a p-value of 0.737, which is above our significance level. Hence, we fail to reject the null hypothesis. There is insufficient evidence to suggest a linear relationship between number of games played and salary, adjusting for number of at bats.

Exercise 4 - 10 points

Consider a linear model with salary as the response and position and batting average as the only two potential predictors of interest. Is there sufficient statistical evidence that the relationship between batting average and salary depends on the position of the player?

Answer this question using only a single formal hypothesis test at the $\alpha$ = 0.05 level.

Models, code, and output provided below for reference:

m4.1.1 <- lm(salary ~ position + AVG, data = dat)
m4.1.2 <- lm(salary ~ position + AVG + position * AVG, data = dat)

anova(m4.1.1, m4.1.2)

Analysis of Variance Table

Model 1: salary ~ position + AVG
Model 2: salary ~ position + AVG + position * AVG
  Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
1   1049 1.3586e+16                                   
2   1047 1.3311e+16  2 2.7523e+14 10.824 2.223e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We are conducting a hypothesis test at the $\alpha = 0.05$ significance level for the null hypothesis $H_0: \beta_{I(Other) \times AVG} = 0$ and $\beta_{I(Pitcher) \times AVG} \neq 0$ (that all interaction terms are zero, which is to say that the relationsihp between batting average and salary does not depend on the player’s position) vs. the alternative $H_1:$ at least one of the interaction terms is non-zero (that the relationship between batting average and salary does depend on the player’s position).

Our test statistic is 10.824, which follows an F distribution with 2 numerator degrees of freedom and 1047 denominator degrees of freedom under $H_0$. This corresponds to a p-value <0.001, which is below our significance level. Hence, we reject the null hypothesis. There is sufficient evidence to suggest that the relationship between batting average and salary does depend on the player’s position.

Exercise 5 - 15 points

Consider another linear model with salary as the response variable and all variables except for the player ID as predictors.

Exercise 5.1

What is the expected difference in salary between someone who bats right vs. someone who bats left, adjusting for the predictors used in the model?

Exercise 5.2

Formally assess whether the linear model assumptions in this model are satisfied. You may assume independence is satisfied. Attach any supporting files needed (e.g., plots, etc.) in the space provided. You are expected to provide professional, high-quality plots that adhere to good visualization practices.

Models, code, and output provided below for reference:

m5.1 <- lm(salary ~ G + AB + AVG + position + allstar + weight + height + bats + ageDebut, data = dat)
tidy(m5.1)

# A tibble: 12 x 5
   term            estimate std.error statistic  p.value
   <chr>              <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)     -754508.  3651987.    -0.207 8.36e- 1
 2 G                 -2106.     1315.    -1.60  1.10e- 1
 3 AB                 1062.      365.     2.91  3.70e- 3
 4 AVG             4865170.  2535770.     1.92  5.53e- 2
 5 positionOther    681652.   325285.     2.10  3.64e- 2
 6 positionPitcher 3224946.   453521.     7.11  2.14e-12
 7 allstar         3387125.   250922.    13.5   2.20e-38
 8 weight            12627.     5724.     2.21  2.76e- 2
 9 height             7483.    52025.     0.144 8.86e- 1
10 batsL           -309373.   346471.    -0.893 3.72e- 1
11 batsR           -350636.   326817.    -1.07  2.84e- 1
12 ageDebut        -121633.    51605.    -2.36  1.86e- 2

m5.1_aug <- augment(m5.1)

ggplot(m5.1_aug, aes(x = .fitted, y = .resid)) + 
  geom_point() + 
  geom_hline(yintercept = 0, color = "darkred") + 
  labs(x = "Fitted (predicted) value", 
       y = "Residual",
       title = "Clear violation of linearity and constant variance assumptions") + 
  theme_bw()

ggplot(m5.1_aug, aes(sample = .resid)) +
  stat_qq() + 
  stat_qq_line() + 
  theme_bw() + 
  labs(x = "Theoretical quantiles", 
       y = "Sample quantiles",
       title = "Clear violation of normality assumption")

Exercise 5.1

$-309,373.4 - $-350,636.3 = $41,262.9.

Adjusting for the other predictors in the model, left-handed batters are expected to make approx. $41,263 more than right-handed batters.

Exercise 5.2

Independence is assumed satisfied in this question from the assumptions (though likely untrue in the real world due to the way teams/contracts are structured).

Linearity is not satisfied, since we do not have symmetric observations about the horizontal axis (way more spread above it than below, and we even see that the residuals are not “centered” along this axis). Constant variance is not satisfied, since the variance of the residuals clearly gets larger as the fitted values get larger. Normality is not satisfied, as evidenced by the clear departure from the diagonal line in the Q-Q plot for both low and high quantiles (the residuals themselves are very right-skewed).

Exercise 6 - 10 points

Fit a linear model with log (base 2) transformed salary as the outcome variable and all variables except for the player ID as predictors.

Exercise 6.1

Interpret the estimated slope term corresponding to being a pitcher in context of the data on both the transformed and non-transformed scales.

Exercise 6.2

Set a random seed equal to the last four digits of your Duke ID. Using your model as a starting point, use LASSO to select variables for a model with $\lambda$ chosen using 10-fold cross validation minimizing MSE.

What was the $\lambda$ that was used? What variables were chosen?

Models, code, and output provided below for reference:

m6.1 <- lm(log2(salary) ~ G + AB + AVG + position + allstar + weight + height + bats + ageDebut, data = dat)
tidy(m6.1)

# A tibble: 12 x 5
   term              estimate std.error statistic  p.value
   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)     19.1        1.50       12.7    1.68e-34
 2 G                0.00113    0.000541    2.10   3.62e- 2
 3 AB               0.0000910  0.000150    0.606  5.44e- 1
 4 AVG              2.20       1.04        2.11   3.53e- 2
 5 positionOther    0.161      0.134       1.20   2.30e- 1
 6 positionPitcher  1.49       0.187       8.01   3.14e-15
 7 allstar          1.24       0.103      12.0    2.41e-31
 8 weight           0.00459    0.00235     1.95   5.12e- 2
 9 height           0.00110    0.0214      0.0515 9.59e- 1
10 batsL           -0.145      0.142      -1.02   3.10e- 1
11 batsR           -0.0915     0.134      -0.681  4.96e- 1
12 ageDebut        -0.0594     0.0212     -2.80   5.23e- 3

2^1.4933

[1] 2.815322

library(glmnet)

set.seed(4380)

y <- log2(dat$salary)
x <- model.matrix(log2(salary) ~ G + AB + AVG + position + allstar + weight + height + bats + ageDebut, data = dat)
m6.1_lasso_cv <- cv.glmnet(x, y, alpha = 1)
best_lambda <- m6.1_lasso_cv$lambda.min
best_lambda

[1] 0.003812728

m6.1_lasso_best <- glmnet(x, y, alpha = 1, lambda = best_lambda)
m6.1_lasso_best$beta

12 x 1 sparse Matrix of class "dgCMatrix"
                           s0
(Intercept)      .           
G                1.144055e-03
AB               8.769415e-05
AVG              1.947328e+00
positionOther    1.218987e-01
positionPitcher  1.419984e+00
allstar          1.243190e+00
weight           4.318390e-03
height           3.658793e-04
batsL           -9.384155e-02
batsR           -4.614191e-02
ageDebut        -5.971399e-02

Exercise 6.1

Adjusting for number of games and at-bats, batting average, all-star status, weight, height, batting hand, and age at debut, we expect a pitcher to have $1.493 higher $\log_2$ salary than a catcher.

Adjusting for number of games and at-bats, batting average, all-star status, weight, height, batting hand, and age at debut, we predict that a pitcher has 2.815 times the salary of a catcher.

Exercise 6.2

The $\lambda$ chosen was 0.003813. All variables were included in the LASSO model.

Exercise 7 - 10 points

This question is based on Brown, Sarvet, and Shmulewitz’s 2017 JAMA paper that explored trends in marijuana use among women between 2002-2014 based on the percentage of women who indicated that they had used marijuana in the past month.

A figure from their paper is reproduced below:

Suppose the authors concluded:

The increase over time in adjusted past-month marijuana use did not differ by age group (P = .73).

Exercise 7.1

What specific hypothesis test did they perform to arrive at this conclusion? Explain, specifically referencing the hypothesis test you mentioned and the visualization itself.

Exercise 7.2

Was their statement appropriate to make given their hypothesis test? Explain why or why not.

Exercise 7.1

The authors performed a hypothesis test on an interaction term between time (year) and age group. We know this because they specifically are comparing the increase over time itself (that is, the slope corresponding to year) between the two groups. Example null and alternative hypotheses are $H_0: \beta_{year \times group} = 0$ vs. $H_1: \beta_{year \times group} \neq 0$.

Exercise 7.2

No. Their statement affirms the null hypothesis by stating it is true (“the increase…did not differ”) as opposed to a correct statement along the lines of “there was not enough evidence to suggest a difference,” etc.

Bonus question

In your models using the baseball data, you may have noticed a perfectly straight diagonal line in your residual plots, likely going $\searrow$ (you might notice this for your log-transformed models as well, though I didn’t ask you to create residual plots for those models. Note that these types of patterns aren’t necessarily “bad things” per se).

What is the explanation for this pattern?

This pattern is due to the league contracted minimum starting salary of approx. $500k. All of these players have the same salary, but have different predictions from the model based on their characteristics. The model has no way of “knowing” that they are receiving the MLB minimum contract, as this information wasn’t included (e.g., number of years spent in the league).