Statistics 110B Project -- Spring 1998

Data and Problem

We will analyze the 1987 salaries of Major League Baseball pitchers. The purpose of our analysis will be to explore the relationship between salary and factors that may influence salary. We seek to answer, at least in part, the following questions: 'what variables are related to salary?', 'what is a good statistical model for salary?', and 'what are the limitations of our analysis?' A brief description of the variables that constitute the data set follows.
Variable Type Definition
name record ID pitcher's name,
team86 categorical player's team at the end of in 1986,
league86 categorical player's league at the end of 1986,
win86 continuous number of wins in 1986,
loss86 continuous number of losses in 1986,
era86 continuous earned run average in 1986,
numgam86 continuous number of games in 1986,
numin86 continuous number of innings pitched in 1986,
numsav86 continuous number of saves in 1986,
numyears continuous number of years in the major leagues,
totwin continuous number of wins during his career,
totloss continuous number of losses during his career,
totera continuous earned run average during his career,
totgames continuous number of games during his career,
totin continuous number of innings pitched during his career,
totsav continuous number of saves during his career,
sal87 continuous 1987 annual salary on opening day in thousands of dollars,
divisn86 categorical division,
rank86 continuous position in final league standings in 1986,
homattnd continuous attendance for home games in 1986, and
awyattnd continuous attendance for away games in 1986.

The data set was derived from three sources--1) Sports Illustrated, April 20, 1987, 2) The 1987 Baseball Encyclopedia Update (New York: Collier Books), and 3) the Elias Sports Bureau--by the Statistical Graphics Section of the American Statistical Association (the data are distributed by Statlib a service of the Department of Statistics at Carnegie Mellon University). Cells in the work sheet for which data are not available contain an "m" instead of a data value. I've removed six older players (Joe and Phil Niekro, Nolan Ryan, Bert Blyleven, Don Sutton and Steve Carlton) from the data set. Including these players, the only with 200 or more total wins, would complicate our analysis. Hence our analysis will only apply to pitchers with fewer than 200 career wins.

Write-Up

You may work alone, but I suggest you work in a group of two or three class members (The remaining three lab sessions will be given over to the project, so you might want to work with others from your section. You will probably need to continue your analysis after hours, and can do so at any campus Unix cluster). Group projects are expected to reflect in quality the effort of two or three individuals.

Your report should be about 5 or 6 typed pages in length (excluding figures and appendix material) and include:

  1. a title page listing project group members,
  2. a 2 (max 3) page stand-alone, non-technical "executive" summary of findings,
  3. a 1 page (max) summary of your exploratory analysis (see questions 1 and 2 below),
  4. a 1 page (max) section describing steps you took to reach your final regression model (see question 3 below),
  5. a 1 paragraph (max) discussion section in which you outline the limitations and relevance of your analysis,
  6. and an appendix of plots (at most 4) and relevant computer output (at most 1 regression/other analysis).
You will be graded on the quality (not quantity!) of your write-up (clear, concise spell-checked writing, creativity and care in presentation) as well as on your technical understanding and correct application of statistical methods. The grade break-down is as follows: "executive summary" (8 points), exploratory analysis (4 points), regression analysis (4 points), presentation (2 points), conclusions/criticisms (2 points)--total = 20 points.

Your project report is due in class on April 28th. Late reports will not be accepted.

Down-load the SAS/Insight program

Click here and a program with the project data set will appear in your browser window. Click on "File>Save As..." in Netscape and choose "Format for Saved Document: Text" then click "OK". The program is now saved in your account (in your home directory, by default). The file's name is "project.sas". Return to this page by choosing "GO>Back" from the Netscape menu bar.

To print plots and figures created in SAS you first need to set your printer environment variable so that printed output will be directed to the printer of your choice, most likely the printer that serves the cluster in which you are working. To print in the teer lab, type "setenv PRINTER teerlp1" or "setenv PRINTER teerlp2" before typing "sas project &." If you are working in another cluster replace teerlp1 with the appropriate printer name; a list of printer names and info on printing files can be found here. More information on printing, especially output containing text, can be found here.

To get started type "sas project &" in one of the terminals open on your screen.

"Questions"/Directions for Analysis

1) Familiarize yourself with the data. Identify aspects of the data that will be important in your subsequent analysis. You might, for example, answer the questions: "how many players does the data set comprise?", "what is typical in terms of player experience and performance?" and "what is a typical salary?", etc.

2) Explore relationships between sal87 and other variables in the data set (except for name and team86, players' names and teams). Look at scatter plots of salary by continuous covariates, and box plots or histograms of salary grouped by categorical covariates. What do you see? Do any of the covariates show promise as predictors of salary?

3) Use linear regression to explore the relationship between salary and factors that may influence salary. While factors that determine salary are unknown to us, we can use the variables we have to correlate experience (numyears, numgam86, totgames, numin86, totin), previous year's performance (win86, loss86, era86, numsav86), career performance (totwin, totloss, totera, totsav) and team characteristics (league86, divisn86, rank86, homattnd, awyattnd) with salary.

Verify that the histogram of sal87 is highly right-skewed. Start by fitting a linear regression model for sal87 including (a) the promising covariates identified in "Question 2" and (b) those that you have strong prior beliefs should be included in the model as predictor variables. Plot residuals against fitted values for this regression; verify this plot looks like the right panel of figure 14-10 on page 463 of the book. Create a new response variable logsal87 = log(sal87). Repeat the previous regression using logsal87 as the response variable. Look at plots of residuals against all predictor variables to verify that each residual relationship is flat (some measures of experience like years playing may show a different relationship for early-, mid- and late-career players; look for patterns in the residual plots like those pictured in Figure 14-7(b) on page 462 of the book). Enter appropriate quadratic terms into the model, if necessary. Search for a parsimonious model by one-at-a-time removing variables that (a) have little explanatory ability and (b) for you had weak prior reason to be included in the model.

In summarizing your results, remember that your response variable is the log of salary.

Return to the Stat 110B home page.


iversen@stat.duke.edu
last updated April 6 1998