Lab: Confidence intervals

Purpose and summary of procedures

The purposes of today's lab are

learn how to use S-Plus to find confidence intervals
use simulation to facilitate understanding/intuition about confidence intervals

Starting S-Plus and getting the data

To start S-Plus, click Start, then Programs, then Statistics & Mathematics, then S-PLUS 2000. Click here to get the data for this lab exercise. Use the Netscape menus (File, then Save As to save the data to the file "income.txt" on your desktop). This data set contains one column (the same list of incomes used in our last lab).

To read the data into S-Plus, choose from the S-Plus menus File, then Import Data, then From file. Choose the file that you have just saved. You should have a spreadsheet called "incomes" open with the data in one columns, labeled "V1". Note: By default, S-Plus will name the dataset according to the name of the file from which the data was imported. Your dataset will not be named "incomes" unless you named your data file "incomes" (plus file extension).

We're interested in assessing the mean income of the workers in this company. The data about the whole population is given in our data set. However, let's imagine that we don't have access to this whole file. Instead, we are forced to take a random sample of 40 employees, and make an estimate of the mean income using the incomes of the people in our sample.

Let's begin by taking this random sample of 40 people from our data set. Go to Data, Random sample. Choose "income" as your data set, and make sure to set the sample size as 40. This random sample should appear as a separate data sheet.

To find a 95% confidence interval for the population mean (based on the sample you just took), go to Statistics, Compare samples, One sample, t test. Note that we will be using this same menu for hypothesis tests (subject of the next few lectures), since these tests and the confidence interval methodology are closely related. Make sure to enter the name of the data set which contains the random sample and the correct column. You also need to choose 0 for the "mean under the null hypothesis", "two-sided" as the alternative hypothesis, and confidence level "0.95" In the report that is generated, you will find the confidence interval on the fourth and fifth lines.

Does your confidence interval seem reasonable?

Go to Statistics, Data summaries, Summary statistics and find the real population mean of all 10000 raw incomes. What percentage of confidence intervals generated in this fashion are likely to enclose the actual mean?

Try this procedure again with other random samples, and see what happens.

Does your confidence interval change each time? How does changing the sample size affect the length of the confidence interval?

Simulation: What percentage of samples will yield confidence intervals that actually bracket the parameter of interest?

Now, we're going to take 100 random samples of size 40 from the population of incomes. For each sample, we will calculate the confidence interval. At the end, we will generate a plot that shows how many samples yielded confidence intervals that actually bracketed the mean. To do this, insert the code below into the S-Plus "commands" window. Note that the first 3 lines determine how many samples you take, how large each of those samples are, and what your level of significance is. You can change these values if you'd like.

numsamps _ 100  #How many samples will we take?
sampsize _ 40   #How large will each sample be?
alpha _ 0.05    #What is our level of significance?
	
lower.bd _ rep(0,numsamps)
upper.bd _ rep(0,numsamps)

t.stat _ qt(1-(alpha/2), df=sampsize-1)  #Calculates the t-statistic.

for (i in 1:numsamps){

	newsamp _ sample(incomes[,1], size=sampsize, replace=F)
	std.err _ sqrt(var(newsamp, unbiased=T)/sampsize)
	lower.bd[i] _ mean(newsamp) - (t.stat*std.err)
	upper.bd[i] _ mean(newsamp) + (t.stat*std.err)
	}

#How many covered the true mean?
covered _ rep(0, numsamps)
covered[lower.bd < mean(incomes[,1]) & upper.bd > mean(incomes[,1])] _ 1

plot(c(min(lower.bd), max(upper.bd)), c(1,numsamps), xlab="", ylab="", type="n")
for (i in 1:numsamps){
	if (covered[i] == 1) segments(lower.bd[i], i, upper.bd[i], i, col=2)
	else segments(lower.bd[i], i, upper.bd[i], i, col=3)
	}
abline(v=mean(incomes[,1]))
title(sub=paste(sum(covered)," of ",numsamps," intervals covered the true mean of ",mean(incomes[,1])))

Notice that confidence intervals that don't covered the true population mean are colored differently. How many confidence intervals failed to cover the mean?
In repeated simulations, is the performance of the confidence intervals generally consistent?

Don't forget to logout from your PC when you are done!