STA110B: Computer lab for week 5

Checking out the Central Limit Theorem on your own

In recent lectures, a major focus point has been the Central Limit Theorem (CLT). Remember, the CLT states that if the population is normal or if the sample size is large enough, the sampling distribution of the the sample mean is approximately normally distributed with a mean equal to the population mean and a standard deviation equal to the population standard deviation divided by the square root of the sample size.

The purpose of this lab is "to see the CLT in action for yourself". How large does the sample size have to be in order for the sampling distribution to approach normality? How does this differ for various populations? We will use a computer program (written in C) to take some "samples" from a larger "population" and to calculate the corresponding sample means. Then, we will graph the results using SAS/INSIGHT.

Preliminaries

In the first lab, you should have created a directory to store the data sets and other files related to this class. You probably called it "sta110b". To see if it's there, you can issue the command "ls" at the UNIX prompt to list all the files and subdirectories stemming from your home directory. As a reminder, it looks like this (with the prompt and command all together, and with your username and machine name substituted for mine):

jls11@teer17% ls

Among the names listed, you should see "sta110b". Now, change to that directory with the command

cd sta110b

Now, you want to copy the compiled C program and the files containing the "populations" for this exercise from my public directory, using the commands

cp ~jls11/public/sample.exe sample.exe
cp ~jls11/public/normal_population normal_population
cp ~jls11/public/rtskew_population rtskew_population
cp ~jls11/public/bimodal_population bimodal_population

If you use the "ls" command to list all the files in your sta110b directory, you should see the files "sample.exe", "normal_distribution", etc. among the other things listed there.

Start SAS

At the UNIX prompt, type

sas &

The SAS program will generate about 4 different windows (toolbox, program editor, log, output). You will probably want to "iconify" the output window, which we won't need in this exercise. To do this, click on the dot in the upper right hand corner the window's title bar. The window will be reduced to small pictorial representation on the right-hand side of the screen. To get it back, double-click on it.

Read the "population" files into SAS

The "population" files each contain 5000 numbers, randomly sampled from a distribution. We can treat a list of numbers as a "population"; we propose to take a series of samples (of some size that you will determine) to estimate the population mean. We can use SAS/INSIGHT to calculate this mean and then see how close our samples means are to the actual mean. After using the "sample.exe" file that you copied earlier to take a series of random samples, you will be able to see that the histogram of all samples means does follow the Central Limit Theorem.

One way to get an idea of what is in the "population" files (you'd probably like to see the shapes of the population histograms before you use the sampling program) is to read them into SAS. This requires some code to be entered in the SAS program editor window. You can copy and paste this code from Netscape directly into the program editor. Highlight the text to be copied using the mouse's left button. Then, position the cursor at the spot in the program editor where you want the text to go. In this case, you want the text to begin 2 spaces to the right of 00001. Click the middle button to paste the text. (The numbers 00001, 00002, etc. are there primarily for mainframe SAS users; you can just ignore them.) Now we tell SAS in which SAS library (for us, this amounts to which directory) datasets can be placed.

 libname sta110b '~/sta110b';

This tells SAS that it can store/read datasets from your "sta110b" directory (which is the directory into which you copied the "population" files). After entering this in the program editor, you need to submit this line to SAS. Either choose the running man icon in the toolbox window or choose submit under Locals in the program editor. The log window should contain a message to the effect that the "libref" (the library reference - "sta110b") was assigned successfully. If you don't see this message, an error has been made, and the following steps won't work correctly.

To read the data set into SAS, submit the following code. As you can see from the first line, the data set will be named ex1 and will be stored in the directory refered to by the libref sta110b (which is your newly created sta110b directory).

data sta110b.populn;
  infile '~/sta110b/normal_population';  /* reads data from file "normal_population" */
  input normal;                          /* one variable named "normal" in file "normal_population" */
  infile '~/sta110b/rtskew_population';  /* reads data from file "rtskew_population" */
  input rtskew;                          /* one variable named "rtskew" in file "rtskew_population" */
  infile '~/sta110b/bimodal_population'; /* reads data from file "bimodal_population" */
  input bimodal;                         /* one variable named "bimodal" in file "bimodal_population" */
run;

So, the dataset is called populn and is stored in the library sta110b (which we have set to the directory sta110b). It has three variables called normal, rtskew, and bimodal (representing our three populations of interest). The log window should tell us that there are 5000 observations and 3 variables.

Start INSIGHT

Go the command line in the toolbox window, type insight, followed by return. In the next window, choose the dataset you created; the data window containing your data set should appear. Notice that the dataset is quite large; in the upper left of the "spreadsheet" we can reaffirm that there are 5000 rows and 3 columns.

Create a histogram of all the data

Under the Analyze menu, choose the option Distribution. Click on each variable, and then click on the button labeled Y to add each to the list of variable to be graphed. (You can add all three at once if you hold hold down the control key when you click on each variable in turn.) All three variables will be treated as response variables. After you "OK" your selections, another window will appear with the histograms and summary statistics. From this we can see that each of the histograms has a different shape (explaining the variable names chosen for each). Whenever you want to, you can close the histogram window by selecting End under the File menu.

Use the C program to generate random samples from each distribution

Go back to the xterm window where you issued the "ls" and "cp" commands before. (We're going to ignore SAS for a few minutes). Before starting the sampling program, type (at the prompt)

limit coredumpsize 0

This command prevents the computer from generating a file called "core" automatically if one or more runs of the sampling program should later exit in an error (thus the name "core dump"). These core files tend to be large and take a fair amount of disk space, and we won't be using them. To start the sampling program, type

./sample.exe

at the prompt. The program will prompt you for information such as the population size (which we remember was 5000) and the name of the file containing the distribution you're interested in (start with "normal_population"); you can choose the sample size, number of samples, and the output file name yourself. (If you want, you could start with samples of size 20, 500 samples, and output filename "normsamp1".) The output file will contain the sample means for the specified number of random samples. To exit the program at any time, type Ctrl-c (type "c" while holding down the control key). You need to be careful to enter each piece of information carefully, following each answer with a simple press of the "return" key. Since this is not a commercial program and was not written by a professional C programmer (it was written by your stats instructor), it doesn't have error checking. If you see an error like "Segmentation violation", that means that you probably made an error in entering information. Just start the program again.

When the program is done, the prompt will reappear. If you issue the list "ls" command again, you should see the name of the file you chose as the output file name. Run the program a couple of times with different sample sizes (remember to save them to different output files), and then use the following instructions to compare them using SAS.

Read in the sample means as a SAS dataset

It's probably easiest to look first at the normal distribution and the several sets of sample means that you generated using it. (Later, you can use these same instructions for dealing with the bimodal distribution and the sample means that you generated using it, etc.) To bind this sampling information into one dataset, submit the following code in the program editor window

data sta110b.normsamp; /* change "normsamp" to another name when dealing with bimodal or rightskewed distribtuions (to avoid overwriting "normsamp") */
  infile '~/sta110b/your_outputfile1'; /* first output file to read */
  input samp1;                         /* name of variable in first output file */
  infile '~/sta110b/your_outputfile2'; /* second output file to read */
  input samp2;                         /* name of variable in second output file */

  ...  /* add more outputfile results here as necessary (according to the pattern) */

run;

Now, go to your already open SAS/INSIGHT "spreadsheet" window. Choose Open under File to open this new dataset (now you have one spreadsheet up for each dataset). Now choose Distribution under Analyze. Select all variables in the "Y" column, and then "OK". This should produce summary statistics, as well as a histogram, for each variable. You can adjust the tick marks (as in lab 1) to make the scales more comparable. Also, check to see the the mean and standard deviations of the sample means are in accordance with the Central Limit Theorem. The information you need to do this should be displayed in the summary statistics that you generated with the histograms.

Imposing the normal curve over a histogram

In some cases, it may be desirable to impose the normal curve over a histogram, to see how well it fits (or doesn't fit) over the data. To do this (after generating the histogram using the "Distribution" menu choices), choose Parametric density under Curves. Make sure "Normal" is checked; the other defaults are fine. Click "OK", and notice the red lines imposed over the largely green histogram(s).

Running the program with your own distribution

If you have the time or inclination, you can manufacture your own distribution to use with the "sample.exe" program. Your distribution can just consist of a series of measurements, each on its own line (just hit return between data points). To do this, you need to use the text editor "pico". Just type pico at the UNIX prompt, and the window will become an small (primitive) text editor. The commands you need are all listed at the bottom. (Note the ^x means "type x while holding down the control key"). Once you have created and saved your "new distribtuon", you can start the "sample.exe" program and answer the questions with your own newly created "population" in mind.

To end the SAS program

To end SAS, type bye in the toolbox's command field, or choose Exit under the File menu in the log, program editor, or output window. Your datasets will be saved in your sta110b directory.

To logout from ACPUB

While the cursor is on the background (not in the xterm windows, netscape, or the windows corresponding to any other program), click the left button. From the menu that appears, select Logout.

To preserve the security of your account (including your files and password), you must logout!