STA110B: Project Information

Due by 5:00PM on Wednesday, 26APR2000 in Jenise's mailbox (in 211 Old Chem)

Data and description of the problem

Our analysis concerns the possible existence of gender discrimination in employment at a bank (which was indeed sued for sex discrimination). We have data for 32 male and 61 female employees who are all skilled, entry-level clerical workers, hired in the years spanning 1965 through 1975. The data was released by the defense in "Harris Trust and Savings Bank: An Analysis of Employee Compensation", Report 7946, Center for Mathematical Studies in Business and Economics, University of Chicago Graduate School of Business, 1979. We have the following variables in the data set (in this order):

sal_hire: Annual salary at the time of hiring (in U.S. dollars)
sal_1977: Salary as of March 1977 (in U.S. dollars)
sex: Females denoted by "1", males denoted by "0"
seniorty: Months since hired at the bank
age: Age (in months)
educatn: Educational level (in years)
expernce: Work experience before coming to work at the bank (in months)

There are two questions that we seek to answer using this data. Did females tend to receive lower starting salaries than males, even when they were similary qualified? In cases were the women and men had similar backgrounds/qualifications/seniority, did females tend to receive lower pay increases than their male counterparts?

Obtaining the data

You can obtain this data in one of two ways:

1.You can read the data from the data file in my home UNIX directory (as we usually do in lab). The code you will need to get the data into SAS can be found below. This code produces a data set named "project" in your sta110b directory; if you want to change the name change the second line to include your preferred name (which can be a maximum of 8 letters) in place of "project".

libname sta110b '~/sta110b'; data sta110b.project;
  infile '~jls11/public/discrimination.dat';
  input sal_hire sal_1977 sex $ seniorty age educatn expernce;
run;

2.You can save the data as a text file on your computer. This is particularly useful if you're planning to work with SAS on the PCs or if you're using another software package. Just click here in order to see the data. Then, you can save the file using the functionality of your browser with a filename of your choosing on your computer. (With Netscape, use the Save As option under File.) If you're going to use SAS on the PCs, you can use the following code to read the data into SAS from the file "your_filename" with the dataset name "project" in the default SAS library "sasuser".

data sasuser.project;
  infile 'C:\your_filename';
  input sal_hire sal_1977 sex $ seniorty age educatn expernce;
run;

Format/requirements for your write-up

You may work in groups of two or three, with each group handing in one write-up. (Each member of the group does not have to do their own write-up.) If you like, you may work by yourself. It might be easiest to form a group with others from your section, since at least one lab will have some time devoted to project work. Your project should reflect the quality of the combined work of two or three individuals.

You will be graded on the quality of your work, not the quantity of words/pages in your write-up, so be clear and concise. In terms of length, aim for a report of about 5 typewritten pages (not including figures and/or tables), according to these guidelines:

Title page listing group members (doesn't count toward the 5 pages)
2 page (max) non-technical "executive" summary of findings
1 page (max) summary of your exploratory analysis
1 page (max) summary of steps you took to reach your final regression model
1-2 paragraphs (max) talking about the relevance as well as the limitations of your analysis
appendix of plots (no more than 4) and relevant computer output (at most 1 regression/other analysis)

The project will be graded out of 20 points, with a breakdown as follows:

"executive" summary: 8 points
exploratory analysis: 4 points
regression analysis: 4 points
presentation: 2 points
conclusions/criticisms: 2 points

Printing from SAS/INSIGHT

Note that printing tables and graphics from SAS/INSIGHT is more complicated than one might expect. Generally, choosing File -> Print from an INSIGHT output window, doesn't work well. The easiest way to deal with printing is to save each set of graphics to a file, and then print the file.

Graphs and tables tend to take up a lot of room on standard lettersize paper. To keep graphs/tables from getting cut off at the margins (happens when graphs are too big for the page), it's best to reduce the font size. (The default size on the UNIX cluster machines is generally 14.) Choose Edit -> Windows -> Fonts. In the window that appears, you can change the type of font and the size. The font sizes of 8 and 10 generally produce better results. With the PCs the default font tends to be a SAS font (the font's name will contain the word "SAS" if this is the case); you'll probably want to pick one of the more common (and neater-looking) fonts.
Delete from the output window any graphs/tables that you don't need. This makes the file smaller and faster to print, and it keeps you from getting confused (or confusing others) with unnecessary information. To delete a graph or table, first click on its border to select it. Then, choose Edit -> Delete.
Select the items in the window that you want to print. You can do this by clicking on the border of the table or graph. If you want to print all of the tables or graphs in a window, go to Edit -> Windows -> Select all to select all the tables or graphs without having to click on each individually.
To save the selected item(s) as a file, use File -> Save -> Graphics File. In the window that appears, type a name for the file; let's use "graphs.ps" as an example. Choose PS (save as a postscript file) and grey scale. If you want each of the tables/graphs to be saved as a separate file, choose the "One Per File" option. (If you're on a PC, don't try to save the selected items as a postscript file. Instead, use Edit -> Copy to copy the selected graphs and tables to the clipboard. Then, you can easily incorporate the clipboard contents as a bitmap into a Microsoft word document and skip the two steps below.)
Go the xterm window from which you earlier started SAS with the command "sas &". To make sure the file is there, type "ls". You should see a listing of all the files in the directory, including the one you just created. To view the postscript file, use the command "ghostview graphs.ps &". This brings up a window to display the file you made ("graphs.ps"). This allows you to make sure that the graphics/tables you wanted to display in the file are actually there (and to verify that none of them got cut off).
To print the file, type "lpr -Pprinter graphs.ps" in the xterm window, where "printer" is replaced by the printer name of the printer in your cluster. The printers in the public UNIX clusters have labels on them and are named for the cluster that they reside in. For instance, in the Teer cluster (the one in which STA110B sections meet), the 2 printers are named "teerlp1" and "teerlp2".

Questions and "directions" for consideration

This section may be updated as needed.

1.Become familiar with the data. Consider scatterplots of the each of the explanatory variables and salary (both salary at the time of hire and salary in 1977). Are any of the explanatory variables related to one another? You might try using the Scatterplot (under Analyze) to draw a matrix of plots. To do this, determine your variables of interest, then enter this list under both X and Y. Scatterplots of all possible pairs will result (arranged in a matrix).

2.Based on this exploratory analysis, consider performing a log transformation on both salary variables.

3.First, take the case of (log) salary at the time of hire. Leave out "sex" as an explanatory variable for now, in order to consider which other variables are responsible in salary determination. Try to fit a multiple regression model in which the explanatory variables all make strictly linear contributions.

4.Consider adding interaction effects into this model. Which ones seem to significantly improve the model? Use R-sq/adjusted R-sq and the t statistics (as well as your own logic and reasoning) to help you decide which one(s) to include. Remember that it is considered good form to include the linear terms that correspond to interaction terms under consideration.

5.Once you have determined which other variables/interactions/etc. influence salary, add "sex" into the model as an explanatory variable. Perform any appropriate test(s) to see whether sex is a factor in determining salary. If you conclude that sex is a factor, interpret how big of a role it plays in the determination. If you conclude that sex is not a factor, explain your decision in terms of the model, any appropriate tests, etc.

6.To answer the second question (concerning pay increases), pursue a strategy similar to that pursued before, except that here you will want to somehow take into account both the hiring salary and the amount of time since the person began working at the bank.