Statistics 110B Project -- Fall 1999

Data and Problem

We will analyze wage data from the 1985 Current Population Survey, a supplemental survey to the US census. The purpose of our analysis will be to explore the relationship between hourly wages (dollars per hour) and factors that may explain hourly wages. We seek to answer, at least in part, the following questions: 'what is a good statistical model for wages?', 'is there evidence in the data of a gender gap in wages?', and 'what are the limitations of our analysis?' A brief description of the variables that constitute the data set follows.
Variable Type Definition
EDUCATN ordinal number of years of education
SOUTH nominal indicator for Southern region
GENDER nominal indicator variable for gender
EXPERNCE ordinal number of years of work experience
UNION nominal indicator variable for union membership
WAGES continuous wage in dollars per hour
AGE ordinal age in years,
RACE nominal race category
OCCUPATN nominal occupational category
SECTOR nominal sector of economy
MARRIED nominal indicator variable for married

Write-Up

You may work alone, but I suggest you work in a group of two or three (max) class members (The remaining lab sessions will be given over to the project, so you might want to work with others from your section. You will probably need to continue your analysis after hours, and can do so at any campus PC or Unix cluster). Group projects are expected to reflect in quality the effort of two or three individuals.

Your report should be about 5 typed pages in length (excluding figures and appendix material) and include:

  1. a title page listing project group members,
  2. a 2 page (max)stand-alone, non-technical "executive" summary of findings,
  3. a 1 page (max) summary of your exploratory analysis (see questions 1 and 2 below),
  4. a 1 page (max) section describing steps you took to reach your final regression model (see question 3 below),
  5. a 1 paragraph (max) discussion section in which you outline the limitations and relevance of your analysis,
  6. and an appendix of plots (at most 4) and relevant computer output (at most 1 regression/other analysis).
You will be graded on the quality (not quantity!) of your write-up (clear, concise spell-checked writing, creativity and care in presentation) as well as on your technical understanding and correct application of statistical methods. The grade break-down is as follows: "executive summary" (8 points), exploratory analysis (4 points), regression analysis (4 points), presentation (2 points), conclusions/criticisms (2 points)--total = 20 points.

Presume your reader is comfortable with basic statistical methods, but is not an expert in them, and assume that your reader is not familiar with the data. When you use a statistical method like regression or ANOVA explain why and carefully interpret your results. Present only important summaries, plots and their interpretations; don't burden your reader with unnecessary facts and analysis.

Your project report is due in class on December 9th. Late reports will not be accepted.

Down-load the SAS/Insight program

Click here and a program with the project data set will appear in your browser window. Click on "File>Save As..." in Netscape and choose "Save in: D: " then click "Save". The program should now be saved on the D drive of your computer. The file's name is "project.sas". Return to this page by choosing "GO>Back" from the Netscape menu bar.

To get started, open SAS by clicking "Start>Programs>Statistics and Mathematics>SAS System v6.11". Once the SAS environment appears, click on "File>Open". In the "Open" window, type "D:\project.sas" in the "File name:" field and then click "Open". Next, find the button on the menu bar with a picture of a running person. Click on this button and the data set will appear. Some information on printing, especially output containing text, can be found here.

"Questions"/Directions for Analysis

1) Familiarize yourself with the data. Identify aspects of the data that will be important in your subsequent analysis. You might, for example, answer the questions: "what differences, if any, are there between male and female workers, between union and non-union workers?", and "how do employees differ by occupation and sector?", etc. One question you should answer is "are their any redundant variables?", specifically, "what is the relationship between years of experience and the other variables?".

2) Explore relationships between wages and other variables in the data set. Look at scatter plots of wages by continuous covariates, and box plots or histograms of wages grouped by categorical covariates. What do you see? Do any of the covariates show promise as predictors of hourly wages?

3) Use linear regression to explore the relationship between hourly wages and factors that may influence wages. While data for many of the variables that determine wages are not available to us in this data set, we can use the variables we have to correlate type of job (occupation and sector), qualifications (education and experience), personal characteristics (age,gender,married, and race), union membership (union) and region of country (south) with hourly wages.

Start by fitting a linear regression model for wages including (a) promising covariates identified in "Question 2" and (b) those that you have strong prior beliefs should be included in the model as predictor variables; do not include redundant variables. Plot residuals against fitted values for this regression; verify this plot looks like the right panel of figure 14-10 on page 463 of the book. Create a new response variable logwages = log(wages). Repeat the previous regression using logwages as the response variable. Look at plots of residuals against all predictor variables to verify that each residual relationship is flat (for example, is the relationships between age and wages linear?; look for patterns in the residual plots of continuous and ordinal variables like that pictured in Figure 14-7(b) on page 462 of the book). Enter appropriate quadratic terms into the model, if necessary. Work with your model until its residual plots look OK.

Summarize and interpret your regression results: Do you find evidence of a gender gap in wages? On average, controlling for other factors, how much higher/lower do you predict wages to be for a 40 year old worker than a 50 year old worker?, a worker in the South vs. a non Southern worker?, a female worker vs. a male worker?, etc... Remember that your response variable is the log of wages.

Return to the Stat 110B home page.


iversen@stat.duke.edu
last updated November 12 1999