STA 242/ENV 255

Lab 7

3/29/2000

DUE 4/5/2000

This lab will explore different methods of variable selection and Bayesian Model Averaging in the context of the sex discrimination. Priorto lab please read and be prepared to discuss the following:

Reading Assignment for Lab:

Case Study 12.2 (pages 329-332), section 12.5, and 12.7, Chapter 12, Exercise 16
And of course, all of the conceptual exercises! In particular exercises 4, 7 and 8 are relevant for the assignment this week.

Preliminaries:

Read in the dataset CASE1202
Create the variable r as described in Exercise 16; this represents the average annual raise, which will be used as the response variable below.
Download the Splus functions bicreg.ssc and load it into Splus. At the command line enter: source("bicreg.ssc")

Problems:

Answer the following questions below, then write a brief report summarizing your findings from parts 1-4 in one page or less about whether there is evidence that annual raises depend on gender. Your summary should avoid statistical jargon and should be written in language that could be understood by non-statisticians (for example, HarrisTrust and Savings or the plaintiffs of the suit).

Using r as the response (or a transformation of it), use a two-sample t-test to see whether the distribution of raises is different for males than females (is a transformation necessary?) Do side-by-side boxplots to check for equal variance: plot.factor(r ~ as.factor(fsex), data=Case1202) , histograms of r (use fsex==1 for the SUBSET to create a histogram of females only; repeat with fsex==0 for males) Is there any skewness? This is may indicate need for transforation. Look at residuals for fitting the model r ~ fsex. Create any transformed variables. The two-sample t-test can be conducted from the Statistics Menu, using Compare Samples, Two Samples, t-test. Click the box for "Grouping variable" and specify fsex as the grouping variable. Is there a significant difference in raises between males and females? Explain. Construct and interpret a confidence interval for the sex effect - be sure to report the effect on the original units if you did transform the data. Summarize your conclusions in a sentence or two.
Fit a regression model to answer the question "What evidence is there of a sex effect after the effect of age on average raise has been accounted for?" Again do you need transform the data? Be sure to write down the null hypothesis and alternative hypothesis, specify the test statistic and any important information regarding its sampling distribution, and the p-value. (Don't forget to answer the question :-). Also check and comment on assumptions for the model and testing. Construct and interpret a confidence interval for the sex effect - be sure to report the effect on the original units if you did transform the data Summarize your conclusions in a sentence or two.
Fit a regression model to answer the question " What evidence is there of a sex effect after the effect of age and beginning salary have been accounted for?" Be sure to write down the null hypothesis and alternative hypothesis, specify the test statistic and any important information regarding its sampling distribution, and the p-value. Also check and comment on assumptions for the model and testing. Construct and interpret a confidence interval for the sex effect - be sure to report the effect on the original units if you did transform the data Summarize your conclusions in a sentence or two.
Use bicreg to calculate BIC and posterior model probabilities for all possible models with the three potential variables, age, beginning salary, and sex (columns 1,3, and 5 of the dataframe or Case1202[,c(1,3,5)]. If you transformed r previously, use that in place of r. I am assuming r is in column 8 of the dataframe, i.e. Case1202[,8] below.

sex.bic <- bicreg(Case1202[ , c(1,3,5)], Case1202[, 8])

enter

sex.bic

to printout all the results stored in sex.bic. The important components:

probne0 - this is the probability that the coefficients for each variable are non-zero
which - this is a table (or matrix) where each row is a model, and each column is a variable, A "T" means the variable is included; F means excluded.
bic - BIC for each model (small is best); models are ranked based on BIC
postprob - this is the posterior probability of each model (should sum to 1)
postmean - the overall mean of each coefficient given the data; ols has the ordinary least squares estimates for each model.

see function bicreg for more documentation

To create a plot of the model space use

plot.models(sex.bic)

The colors and scaling on the Y-axis correspond to log(postprob) - log(min(postprob) + 1.0 (without the 1.0 this is the log(Bayes Factor) for comparing each model to the worst model. The 1.0is added to all to differentiate the model with no variables and the worst model. The best model appears in the top row; the worst model at the bottom. The default color is blue where variables are excluded. The color is incremented based on log(postprob) - log(min(postprob) + 1.0.

List the 8 models, their BIC values and posterior probabilities. (if you are unclear on how these are calculated fit all 8 models and verify the results by hand).

Which model is the best BIC model (is this the same as what you would get using stepwise selection?)? What is the posterior probability of the best BIC model? What is the posterior probability that sex has an effect on annual raises? How does this result compare to your findings in parts 1-3? Which variables are important in accounting for raises? Explain. Write a brief summary of your Bayesian analysis and conclusions.