Variable | Type | Definition |
---|---|---|
name | record ID | pitcher's name, |
team86 | categorical | player's team at the end of in 1986, |
league86 | categorical | player's league at the end of 1986, |
win86 | continuous | number of wins in 1986, |
loss86 | continuous | number of losses in 1986, |
era86 | continuous | earned run average in 1986, |
numgam86 | continuous | number of games in 1986, |
numin86 | continuous | number of innings pitched in 1986, |
numsav86 | continuous | number of saves in 1986, |
numyears | continuous | number of years in the major leagues, |
totwin | continuous | number of wins during his career, |
totloss | continuous | number of losses during his career, |
totera | continuous | earned run average during his career, |
totgames | continuous | number of games during his career, |
totin | continuous | number of innings pitched during his career, |
totsav | continuous | number of saves during his career, |
sal87 | continuous | 1987 annual salary on opening day in thousands of dollars, |
divisn86 | categorical | division, |
rank86 | continuous | position in final league standings in 1986, |
homattnd | continuous | attendance for home games in 1986, and |
awyattnd | continuous | attendance for away games in 1986. |
The data set was derived from three sources--1) Sports Illustrated, April 20, 1987, 2) The 1987 Baseball Encyclopedia Update (New York: Collier Books), and 3) the Elias Sports Bureau--by the Statistical Graphics Section of the American Statistical Association (the data are distributed by Statlib a service of the Department of Statistics at Carnegie Mellon University). Cells in the work sheet for which data are not available contain an "m" instead of a data value. I've removed six older players (Joe and Phil Niekro, Nolan Ryan, Bert Blyleven, Don Sutton and Steve Carlton) from the data set. Including these players, the only with 200 or more total wins, would complicate our analysis. Hence our analysis will only apply to pitchers with fewer than 200 career wins.
Your report should be about 5 or 6 typed pages in length (excluding figures and appendix material) and include:
Your project report is due in class on April 28th. Late reports will not be accepted.
To print plots and figures created in SAS you first need to set your printer environment variable so that printed output will be directed to the printer of your choice, most likely the printer that serves the cluster in which you are working. To print in the teer lab, type "setenv PRINTER teerlp1" or "setenv PRINTER teerlp2" before typing "sas project &." If you are working in another cluster replace teerlp1 with the appropriate printer name; a list of printer names and info on printing files can be found here. More information on printing, especially output containing text, can be found here.
To get started type "sas project &" in one of the terminals open on your screen.
2) Explore relationships between sal87 and other variables in the data set (except for name and team86, players' names and teams). Look at scatter plots of salary by continuous covariates, and box plots or histograms of salary grouped by categorical covariates. What do you see? Do any of the covariates show promise as predictors of salary?
3) Use linear regression to explore the relationship between salary and factors that may influence salary. While factors that determine salary are unknown to us, we can use the variables we have to correlate experience (numyears, numgam86, totgames, numin86, totin), previous year's performance (win86, loss86, era86, numsav86), career performance (totwin, totloss, totera, totsav) and team characteristics (league86, divisn86, rank86, homattnd, awyattnd) with salary.
Verify that the histogram of sal87 is highly right-skewed. Start by fitting a linear regression model for sal87 including (a) the promising covariates identified in "Question 2" and (b) those that you have strong prior beliefs should be included in the model as predictor variables. Plot residuals against fitted values for this regression; verify this plot looks like the right panel of figure 14-10 on page 463 of the book. Create a new response variable logsal87 = log(sal87). Repeat the previous regression using logsal87 as the response variable. Look at plots of residuals against all predictor variables to verify that each residual relationship is flat (some measures of experience like years playing may show a different relationship for early-, mid- and late-career players; look for patterns in the residual plots like those pictured in Figure 14-7(b) on page 462 of the book). Enter appropriate quadratic terms into the model, if necessary. Search for a parsimonious model by one-at-a-time removing variables that (a) have little explanatory ability and (b) for you had weak prior reason to be included in the model.
In summarizing your results, remember that your response variable is the log of salary.
Return to the Stat 110B home page.