You must turn in a knitted file to Gradescope from a Quarto Markdown file in order to receive credit. Be sure to “associate” questions appropriately on Gradescope. As a reminder, late work is not accepted outside of the 24-hour grace period for homework assignments. The Quarto template for this assignment may be found in the repository at the following link: https://classroom.github.com/a/GFRamF7z

We will once again be using the pre-COVID baseball statistics from last week's homework. As a reminder, these data contain salary information (salary), the number of games that player has ever played (G), batting average (AVG), whether the player is an All-Star (allstar), whether they bat left-handed, right-handed, or with both hands (bats), the age at which they debuted in the Major Leagues (ageDebut), among other variables.

Important: Please continue to make regular commits. Note that to avoid having code being “cut-off,” you may insert line breaks as needed; good places to include them are after plus signs in specifying model predictors (+) or commas in separating model arguments/options (,). We will be taking off points if your code runs off the page!

Important: Please suppress warnings and messages in your R code chunks by including the options message = F, warning = F in your chunks. For instance, ```{r chunk-name, message = F, warning = F}

  1. Fit a linear model that only uses batting hand to predict a player's salary. Create a residual plot and explain why the residual plot looks as it does.
  2. Now fit a linear model that uses batting average, the number of games he has appeared in, All-Star status, batting hand, and age at debut to predict a player's salary. Comprehensively evaluate the assumptions needed for the linear model.
  3. For which observation did the model do the worst in terms of prediction, as measured by the largest absolute magnitude of the residual? What are the characteristics of this player?
  4. Create a variable which corresponds to the ratio of the predicted value to the observed value of the response variable. For which observation did the model do the worst in terms of prediction, as measured by this ratio? In answering this question, keep in mind that this ratio may be below 1 as well; a ratio of 0.25 is equivalent to 4 on the multiplicative scale (in both cases, we have one value being 4 times the other).
  5. In the residual plot in your model from Exercise 2, there exists a curious linear/diagonal pattern in the bottom-left hand component. Explain specifically what is causing this pattern (cite any sources used).