library(tidyverse)
library(tidymodels)
dat <- read.csv("batting.csv")Exam 1 Solutions
STA 211 Spring 2023 (Jiang)
As noted in the instructions and course syllabus, any evidence of academic dishonesty will result in a failing grade for the course.
As noted in the instructions, failure to attach a .qmd file that matches what is in the exam repository will result in an automatic 10 point penalty.
Models, code, and output provided below for reference:
m3.1.1 <- lm(salary ~ G, data = dat)
m3.1.2 <- lm(salary ~ G + AB, data = dat)
tidy(m3.1.1)# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1746785. 169652. 10.3 9.42e-24
2 G 2820. 299. 9.44 2.38e-20
tidy(m3.1.2)# A tibble: 3 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2055549. 215289. 9.55 8.96e-21
2 G -490. 1457. -0.337 7.36e- 1
3 AB 880. 379. 2.32 2.04e- 2
Exercise 3.1
The expected salary for a player that plays in zero games is $1,746,785.20. Although it may make sense to interpret quantities regarding players that have zero games played (this is certainly possible and interpretable, e.g., for rookies before the start of a season), doing so here would be extrapolating outside of the range of games for our dataset (which specifically looks at players with at least twenty recorded games).
Exercise 3.2
For each additional game a player has played, we expect a $490.39 salary decrease, while holding the number of at-bats constant. This is not what most people expect, since intuitively it makes sense for more games played to result in a higher salary, which is what we see in the model with only number of games as a predictor (here, for each additional game a player has played, we would expect a $2,820.37).
This difference is because we are adjusting for at-bats in the second model. Given any value of at-bats, more games are actually associated with lower salaries. This might make more sense, because these players have fewer at-bats per game, which implies that they have less game time (players with higher at-bats per game might be sent out more because they’re better players, for instance).
Exercise 3.3
We are conducting a hypothesis test at the \(\alpha = 0.05\) significance level for the null hypothesis \(H_0: \beta_{G} = 0\) (that there is no linear relationship between games and salary while controlling for number of at bats) vs. the alternative \(H_1: \beta_{G} \neq 0\) (there is such a relationship).
Our test statistic is -0.337, which follows a t distribution with 1050 degrees of freedom under \(H_0\). This corresponds to a p-value of 0.737, which is above our significance level. Hence, we fail to reject the null hypothesis. There is insufficient evidence to suggest a linear relationship between number of games played and salary, adjusting for number of at bats.
Models, code, and output provided below for reference:
m4.1.1 <- lm(salary ~ position + AVG, data = dat)
m4.1.2 <- lm(salary ~ position + AVG + position * AVG, data = dat)
anova(m4.1.1, m4.1.2)Analysis of Variance Table
Model 1: salary ~ position + AVG
Model 2: salary ~ position + AVG + position * AVG
Res.Df RSS Df Sum of Sq F Pr(>F)
1 1049 1.3586e+16
2 1047 1.3311e+16 2 2.7523e+14 10.824 2.223e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We are conducting a hypothesis test at the \(\alpha = 0.05\) significance level for the null hypothesis \(H_0: \beta_{I(Other) \times AVG} = 0\) and \(\beta_{I(Pitcher) \times AVG} \neq 0\) (that all interaction terms are zero, which is to say that the relationsihp between batting average and salary does not depend on the player’s position) vs. the alternative \(H_1:\) at least one of the interaction terms is non-zero (that the relationship between batting average and salary does depend on the player’s position).
Our test statistic is 10.824, which follows an F distribution with 2 numerator degrees of freedom and 1047 denominator degrees of freedom under \(H_0\). This corresponds to a p-value <0.001, which is below our significance level. Hence, we reject the null hypothesis. There is sufficient evidence to suggest that the relationship between batting average and salary does depend on the player’s position.
Models, code, and output provided below for reference:
m5.1 <- lm(salary ~ G + AB + AVG + position + allstar + weight + height + bats + ageDebut, data = dat)
tidy(m5.1)# A tibble: 12 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -754508. 3651987. -0.207 8.36e- 1
2 G -2106. 1315. -1.60 1.10e- 1
3 AB 1062. 365. 2.91 3.70e- 3
4 AVG 4865170. 2535770. 1.92 5.53e- 2
5 positionOther 681652. 325285. 2.10 3.64e- 2
6 positionPitcher 3224946. 453521. 7.11 2.14e-12
7 allstar 3387125. 250922. 13.5 2.20e-38
8 weight 12627. 5724. 2.21 2.76e- 2
9 height 7483. 52025. 0.144 8.86e- 1
10 batsL -309373. 346471. -0.893 3.72e- 1
11 batsR -350636. 326817. -1.07 2.84e- 1
12 ageDebut -121633. 51605. -2.36 1.86e- 2
m5.1_aug <- augment(m5.1)
ggplot(m5.1_aug, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, color = "darkred") +
labs(x = "Fitted (predicted) value",
y = "Residual",
title = "Clear violation of linearity and constant variance assumptions") +
theme_bw()
ggplot(m5.1_aug, aes(sample = .resid)) +
stat_qq() +
stat_qq_line() +
theme_bw() +
labs(x = "Theoretical quantiles",
y = "Sample quantiles",
title = "Clear violation of normality assumption")
Exercise 5.1
$-309,373.4 - $-350,636.3 = $41,262.9.
Adjusting for the other predictors in the model, left-handed batters are expected to make approx. $41,263 more than right-handed batters.
Exercise 5.2
Independence is assumed satisfied in this question from the assumptions (though likely untrue in the real world due to the way teams/contracts are structured).
Linearity is not satisfied, since we do not have symmetric observations about the horizontal axis (way more spread above it than below, and we even see that the residuals are not “centered” along this axis). Constant variance is not satisfied, since the variance of the residuals clearly gets larger as the fitted values get larger. Normality is not satisfied, as evidenced by the clear departure from the diagonal line in the Q-Q plot for both low and high quantiles (the residuals themselves are very right-skewed).
Models, code, and output provided below for reference:
m6.1 <- lm(log2(salary) ~ G + AB + AVG + position + allstar + weight + height + bats + ageDebut, data = dat)
tidy(m6.1)# A tibble: 12 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 19.1 1.50 12.7 1.68e-34
2 G 0.00113 0.000541 2.10 3.62e- 2
3 AB 0.0000910 0.000150 0.606 5.44e- 1
4 AVG 2.20 1.04 2.11 3.53e- 2
5 positionOther 0.161 0.134 1.20 2.30e- 1
6 positionPitcher 1.49 0.187 8.01 3.14e-15
7 allstar 1.24 0.103 12.0 2.41e-31
8 weight 0.00459 0.00235 1.95 5.12e- 2
9 height 0.00110 0.0214 0.0515 9.59e- 1
10 batsL -0.145 0.142 -1.02 3.10e- 1
11 batsR -0.0915 0.134 -0.681 4.96e- 1
12 ageDebut -0.0594 0.0212 -2.80 5.23e- 3
2^1.4933[1] 2.815322
library(glmnet)
set.seed(4380)
y <- log2(dat$salary)
x <- model.matrix(log2(salary) ~ G + AB + AVG + position + allstar + weight + height + bats + ageDebut, data = dat)
m6.1_lasso_cv <- cv.glmnet(x, y, alpha = 1)
best_lambda <- m6.1_lasso_cv$lambda.min
best_lambda[1] 0.003812728
m6.1_lasso_best <- glmnet(x, y, alpha = 1, lambda = best_lambda)
m6.1_lasso_best$beta12 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) .
G 1.144055e-03
AB 8.769415e-05
AVG 1.947328e+00
positionOther 1.218987e-01
positionPitcher 1.419984e+00
allstar 1.243190e+00
weight 4.318390e-03
height 3.658793e-04
batsL -9.384155e-02
batsR -4.614191e-02
ageDebut -5.971399e-02
Exercise 6.1
Adjusting for number of games and at-bats, batting average, all-star status, weight, height, batting hand, and age at debut, we expect a pitcher to have $1.493 higher \(\log_2\) salary than a catcher.
Adjusting for number of games and at-bats, batting average, all-star status, weight, height, batting hand, and age at debut, we predict that a pitcher has 2.815 times the salary of a catcher.
Exercise 6.2
The \(\lambda\) chosen was 0.003813. All variables were included in the LASSO model.
Exercise 7.1
The authors performed a hypothesis test on an interaction term between time (year) and age group. We know this because they specifically are comparing the increase over time itself (that is, the slope corresponding to year) between the two groups. Example null and alternative hypotheses are \(H_0: \beta_{year \times group} = 0\) vs. \(H_1: \beta_{year \times group} \neq 0\).
Exercise 7.2
No. Their statement affirms the null hypothesis by stating it is true (“the increase…did not differ”) as opposed to a correct statement along the lines of “there was not enough evidence to suggest a difference,” etc.
This pattern is due to the league contracted minimum starting salary of approx. $500k. All of these players have the same salary, but have different predictions from the model based on their characteristics. The model has no way of “knowing” that they are receiving the MLB minimum contract, as this information wasn’t included (e.g., number of years spent in the league).
