Lab: Central Limit Theorem

Purpose and summary of procedures

The purpose of today's lab is to use simulation to visualize the sampling distribution for the sample mean (as we have discussed in class). You should remember that the Central Limit Theorem (CLT) tells us that as our sample size gets larger, the sampling distribution of the sample mean converges to a normal distribution. Therefore, when we have a large sample size, we can say that the sampling distribution for the sample mean is approximately normal, regardless of the distribution from which we are sampling.

The procedure for today's lab is listed below.

Define the population. In our case, we begin with one "population" of 10000 incomes.
Define the population parameter of interest. As a simple example, you may be interested in mu, where mu represents the mean income of people in the population. (As an example, another possible interest is in the proportion of the population, p, who make more than a certain amount.)
Determine what statistic/estimate you would use to estimate the population parameter. For the first example above, this would be x-bar, the sample mean. (In the latter case, we would use p-hat, the sample proportion.)
Use S-Plus to take a random sample from the population of interest, then calculate the sample statistic for this sample. This is analogous to the "real-life" procedure of randomly selecting a sample of people, recording data about them, and calculating the mean of their heights or the proportion of people meeting a certain criterion.
After following the procedure above with many different samples, we can make a histogram of the sample means to see what the sampling distribution looks like.

The Central Limit Theorem (CLT) states that as the sample size grows larger, the sampling distribution for the sample mean, x-bar, converges to normal. The sample mean is an unbiased estimator (of the quantity we're interested in, mu), so their expected value of the sample mean is equal to the parameter which we are trying to estimate. This means that the sampling distribution of the sample mean must be centered around the population parameter mu.

Starting S-Plus and getting the data

To start S-Plus, click Start, then Programs, then Statistics & Mathematics, then S-PLUS 2000. Click here to get the data for this lab exercise. Use the Netscape menus (File, then Save As to save the data to the file "incomes.txt" on your desktop).

To read the data into S-Plus, choose from the S-Plus menus File, then Import Data, then From file. Choose the file that you have just saved. You should have a spreadsheet called "incomes" open with the data in one column, labeled "V1". Note: By default, S-Plus will name the dataset according to the name of the file from which the data was imported. Your dataset will not be named "incomes" unless you named your data file "incomes" plus file extension.

Surveying the populations

Let's start by taking a look at the data, which will serve as a "population" for the purposes of this exercise.) Imagine that this data consists of total incomes of full-time workers in a company. To make a histogram of these incomes, choose Graph, then 2D Plot. From the list of graphs, choose "Histogram". In the window that appears make sure your data set's name is displayed, and choose the desired column to plot. Once the histogram has appeared, you can adjust various aspects of the plot by using the popup menu "Data to Plot" that appears if you right-click on the bars of the histogram.

Unlike the "real-life" case, we have the entire population here, and so we can obtain the population parameters (mean, etc.) using Statistics, Data summaries, then Summary statistics. It's important to know what the mean actually is so that you can see whether the sample means we look at later are in the right ballpark.

How would you describe the shape of this population?
What is the mean income of this population?
What sampling statistic/point estimate would you use to estimate the mean of this population if you were given a random sample from the population?

Simulated sampling (sample means)

Begin by taking a small sample of 5 people from the population. To generate this random sample, use the following command (in the "Commands" window) to randomly select five different members of the population given in the first column of the dataset. (This assumes a dataset named "incomes".)

 sample(incomes[,1], replace=F, size=5)

What is the sample mean for your sample?
How far/close is it to the actual population mean?

Simulation with n=5

Now, we'd like to get an idea of what happens when we take multiple random samples of size 5. Use the following command to take a series of 1000 samples of size 5, find the corresponding sample means, and store the sample means in the second column.

for (i in 1:1000) incomes[i,2] _ mean(sample(incomes[,1], replace=F, size=5))

Now make a histogram of all the sample means. To calculate the mean and standard deviation of your sample means you can issue the commands

mean(incomes[,2], na.rm=T) 
sqrt(var(incomes[,2], unbiased=F, na.method="omit"))

Describe the shape of the histogram.
What is the center of the distribution of sample means?

Simulation with n=15

Now let's try taking another 1000 samples, but with a sample size of 15. So, we alter the command to make the sample size larger and to place the resulting sample means in the third column of the "incomes" dataset.

for (i in 1:1000) incomes[i,3] _ mean(sample(incomes[,1], replace=F, size=15))

Describe the shape of the histogram of sample means (using sample size n=15).
What is the center of the distribution of sample means?
What differences did you note between the outcome using a sample size of 15 and the previous exercise with sample size 5?

Simulation with n=35

Finally, let's try the procedure with a sample size of 35.

for (i in 1:1000) incomes[i,4] _ mean(sample(incomes[,1], replace=F, size=35))

Describe the shape of the histogram of sample means (using n=35).
What is the center of the distribution of sample means?
As the sample size grew, did your results confirm the CLT? Compare your predictions for the mean and standard deviation of the sampling distribution for x-bar with the theoretical values of mu and sigma/sqrt(n) that we talked about in class.

Don't forget to logout from your PC when you are done!