Lab: Central Limit Theorem
Purpose and summary of procedures
The purpose of today's lab is to use simulation to visualize the sampling
distribution for the sample mean (as we have discussed in class). You should
remember that the Central Limit Theorem (CLT) tells us that as our sample
size gets larger, the sampling distribution of the sample mean converges
to a normal distribution. Therefore, when we have a large sample size,
we can say that the sampling distribution for the sample mean is approximately
normal, regardless of the distribution from which we are sampling.
The procedure for today's lab is listed below.
-
Define the population. In our case, we begin with one "population" of 10000
incomes.
-
Define the population parameter of interest. As a simple
example, you may be interested in mu, where mu represents the mean income
of people in the population. (As an example, another possible interest
is in the proportion of the population, p, who make more than a certain
amount.)
-
Determine what statistic/estimate you would use to estimate the population
parameter. For the first example above, this would be x-bar, the sample
mean. (In the latter case, we would use p-hat, the sample proportion.)
-
Use S-Plus to take a random sample from the population of interest, then
calculate the sample statistic for this sample. This is analogous to the
"real-life" procedure of randomly selecting a sample of people, recording
data about them, and calculating the mean of their heights or the proportion
of people meeting a certain criterion.
-
After following the procedure above with many different samples, we can
make a histogram of the sample means to see what the sampling distribution
looks like.
The Central Limit Theorem (CLT) states that as the sample size grows
larger, the sampling distribution for the sample mean, x-bar, converges
to normal. The sample mean is an unbiased estimator (of the quantity
we're interested in, mu), so their expected value of the sample mean is
equal to the parameter which we are trying to estimate. This means that
the sampling distribution of the sample mean must be centered around the
population parameter mu.
Starting S-Plus and getting the data
To start S-Plus, click Start, then Programs, then
Statistics
& Mathematics, then S-PLUS 2000. Click
here
to get the data for this lab exercise. Use the Netscape menus (File,
then Save As to save the data to the file "incomes.txt" on your
desktop).
To read the data into S-Plus, choose from the S-Plus menus
File,
then Import Data, then From file. Choose the file that you
have just saved. You should have a spreadsheet called "incomes" open with
the data in one column, labeled "V1". Note: By default, S-Plus will
name the dataset according to the name of the file from which the data
was imported. Your dataset will not be named "incomes" unless you named
your data file "incomes" plus file extension.
Surveying the populations
Let's start by taking a look at the data, which will serve as a "population"
for the purposes of this exercise.) Imagine that this data consists of
total incomes of full-time workers in a company. To make a histogram of
these incomes, choose Graph, then 2D Plot. From the list
of graphs, choose "Histogram". In the window that appears make sure your
data set's name is displayed, and choose the desired column to plot. Once
the histogram has appeared, you can adjust various aspects of the plot
by using the popup menu "Data to Plot" that appears if you right-click
on the bars of the histogram.
Unlike the "real-life" case, we have the entire population here, and
so we can obtain the population parameters (mean, etc.) using
Statistics,
Data summaries, then Summary statistics. It's important to
know what the mean actually is so that you can see whether the sample means
we look at later are in the right ballpark.
-
How would you describe the shape of this population?
-
What is the mean income of this population?
-
What sampling statistic/point estimate would you use to estimate the mean
of this population if you were given a random sample from the population?
Simulated sampling (sample means)
Begin by taking a small sample of 5 people from the population. To generate
this random sample, use the following command (in the "Commands" window)
to randomly select five different members of the population given in the
first column of the dataset. (This assumes a dataset named "incomes".)
sample(incomes[,1], replace=F, size=5)
-
What is the sample mean for your sample?
-
How far/close is it to the actual population mean?
Simulation with n=5
Now, we'd like to get an idea of what happens when we take multiple random
samples of size 5. Use the following command to take a series of 1000 samples
of size 5, find the corresponding sample means, and store the sample means
in the second column.
for (i in 1:1000) incomes[i,2] _ mean(sample(incomes[,1], replace=F, size=5))
Now make a histogram of all the sample means. To calculate the mean and
standard deviation of your sample means you can issue the commands
mean(incomes[,2], na.rm=T)
sqrt(var(incomes[,2], unbiased=F, na.method="omit"))
-
Describe the shape of the histogram.
-
What is the center of the distribution of sample means?
Simulation with n=15
Now let's try taking another 1000 samples, but with a sample size of 15.
So, we alter the command to make the sample size larger and to place the
resulting sample means in the third column of the "incomes" dataset.
for (i in 1:1000) incomes[i,3] _ mean(sample(incomes[,1], replace=F, size=15))
-
Describe the shape of the histogram of sample means (using sample size
n=15).
-
What is the center of the distribution of sample means?
-
What differences did you note between the outcome using a sample size of
15 and the previous exercise with sample size 5?
Simulation with n=35
Finally, let's try the procedure with a sample size of 35.
for (i in 1:1000) incomes[i,4] _ mean(sample(incomes[,1], replace=F, size=35))
-
Describe the shape of the histogram of sample means (using n=35).
-
What is the center of the distribution of sample means?
-
As the sample size grew, did your results confirm the CLT? Compare your
predictions for the mean and standard deviation of the sampling distribution
for x-bar with the theoretical values of mu and sigma/sqrt(n) that we talked
about in class.
Don't forget to logout from your PC when you are done!