In recent lectures, a major focus point has been the Central Limit Theorem (CLT). Remember, the CLT states that if the population is normal or if the sample size is large enough, the sampling distribution of the the sample mean is approximately normally distributed with a mean equal to the population mean and a standard deviation equal to the population standard deviation divided by the square root of the sample size.
The purpose of this lab is "to see the CLT in action for yourself". How large does the sample size have to be in order for the sampling distribution to approach normality? How does this differ for various populations? We will use a computer program (written in C) to take some "samples" from a larger "population" and to calculate the corresponding sample means. Then, we will graph the results using SAS/INSIGHT.
In the first lab, you should have created a directory to store the data sets and other files related to this class. You probably called it "sta110b". To see if it's there, you can issue the command "ls" at the UNIX prompt to list all the files and subdirectories stemming from your home directory. As a reminder, it looks like this (with the prompt and command all together, and with your username and machine name substituted for mine):
jls11@teer17% lsAmong the names listed, you should see "sta110b". Now, change to that directory with the command
cd sta110bNow, you want to copy the compiled C program and the files containing the "populations" for this exercise from my public directory, using the commands
cp ~jls11/public/sample.exe sample.exe cp ~jls11/public/normal_population normal_population cp ~jls11/public/rtskew_population rtskew_population cp ~jls11/public/bimodal_population bimodal_populationIf you use the "ls" command to list all the files in your sta110b directory, you should see the files "sample.exe", "normal_distribution", etc. among the other things listed there.
sas &The SAS program will generate about 4 different windows (toolbox, program editor, log, output). You will probably want to "iconify" the output window, which we won't need in this exercise. To do this, click on the dot in the upper right hand corner the window's title bar. The window will be reduced to small pictorial representation on the right-hand side of the screen. To get it back, double-click on it.
One way to get an idea of what is in the "population" files (you'd probably like to see the shapes of the population histograms before you use the sampling program) is to read them into SAS. This requires some code to be entered in the SAS program editor window. You can copy and paste this code from Netscape directly into the program editor. Highlight the text to be copied using the mouse's left button. Then, position the cursor at the spot in the program editor where you want the text to go. In this case, you want the text to begin 2 spaces to the right of 00001. Click the middle button to paste the text. (The numbers 00001, 00002, etc. are there primarily for mainframe SAS users; you can just ignore them.) Now we tell SAS in which SAS library (for us, this amounts to which directory) datasets can be placed.
libname sta110b '~/sta110b';This tells SAS that it can store/read datasets from your "sta110b" directory (which is the directory into which you copied the "population" files). After entering this in the program editor, you need to submit this line to SAS. Either choose the running man icon in the toolbox window or choose submit under Locals in the program editor. The log window should contain a message to the effect that the "libref" (the library reference - "sta110b") was assigned successfully. If you don't see this message, an error has been made, and the following steps won't work correctly.
To read the data set into SAS, submit the following code. As you can see from the first line, the data set will be named ex1 and will be stored in the directory refered to by the libref sta110b (which is your newly created sta110b directory).
data sta110b.populn; infile '~/sta110b/normal_population'; /* reads data from file "normal_population" */ input normal; /* one variable named "normal" in file "normal_population" */ infile '~/sta110b/rtskew_population'; /* reads data from file "rtskew_population" */ input rtskew; /* one variable named "rtskew" in file "rtskew_population" */ infile '~/sta110b/bimodal_population'; /* reads data from file "bimodal_population" */ input bimodal; /* one variable named "bimodal" in file "bimodal_population" */ run;So, the dataset is called populn and is stored in the library sta110b (which we have set to the directory sta110b). It has three variables called normal, rtskew, and bimodal (representing our three populations of interest). The log window should tell us that there are 5000 observations and 3 variables.
Go the command line in the toolbox window, type insight, followed by return. In the next window, choose the dataset you created; the data window containing your data set should appear. Notice that the dataset is quite large; in the upper left of the "spreadsheet" we can reaffirm that there are 5000 rows and 3 columns.
Under the Analyze menu, choose the option Distribution. Click on each variable, and then click on the button labeled Y to add each to the list of variable to be graphed. (You can add all three at once if you hold hold down the control key when you click on each variable in turn.) All three variables will be treated as response variables. After you "OK" your selections, another window will appear with the histograms and summary statistics. From this we can see that each of the histograms has a different shape (explaining the variable names chosen for each). Whenever you want to, you can close the histogram window by selecting End under the File menu.
Go back to the xterm window where you issued the "ls" and "cp" commands before. (We're going to ignore SAS for a few minutes). Before starting the sampling program, type (at the prompt)
limit coredumpsize 0This command prevents the computer from generating a file called "core" automatically if one or more runs of the sampling program should later exit in an error (thus the name "core dump"). These core files tend to be large and take a fair amount of disk space, and we won't be using them. To start the sampling program, type
./sample.exeat the prompt. The program will prompt you for information such as the population size (which we remember was 5000) and the name of the file containing the distribution you're interested in (start with "normal_population"); you can choose the sample size, number of samples, and the output file name yourself. (If you want, you could start with samples of size 20, 500 samples, and output filename "normsamp1".) The output file will contain the sample means for the specified number of random samples. To exit the program at any time, type Ctrl-c (type "c" while holding down the control key). You need to be careful to enter each piece of information carefully, following each answer with a simple press of the "return" key. Since this is not a commercial program and was not written by a professional C programmer (it was written by your stats instructor), it doesn't have error checking. If you see an error like "Segmentation violation", that means that you probably made an error in entering information. Just start the program again.
When the program is done, the prompt will reappear. If you issue the list "ls" command again, you should see the name of the file you chose as the output file name. Run the program a couple of times with different sample sizes (remember to save them to different output files), and then use the following instructions to compare them using SAS.
data sta110b.normsamp; /* change "normsamp" to another name when dealing with bimodal or rightskewed distribtuions (to avoid overwriting "normsamp") */ infile '~/sta110b/your_outputfile1'; /* first output file to read */ input samp1; /* name of variable in first output file */ infile '~/sta110b/your_outputfile2'; /* second output file to read */ input samp2; /* name of variable in second output file */ ... /* add more outputfile results here as necessary (according to the pattern) */ run;Now, go to your already open SAS/INSIGHT "spreadsheet" window. Choose Open under File to open this new dataset (now you have one spreadsheet up for each dataset). Now choose Distribution under Analyze. Select all variables in the "Y" column, and then "OK". This should produce summary statistics, as well as a histogram, for each variable. You can adjust the tick marks (as in lab 1) to make the scales more comparable. Also, check to see the the mean and standard deviations of the sample means are in accordance with the Central Limit Theorem. The information you need to do this should be displayed in the summary statistics that you generated with the histograms.
While the cursor is on the background (not in the xterm windows, netscape, or the windows corresponding to any other program), click the left button. From the menu that appears, select Logout.
To preserve the security of your account (including your files and password), you must logout!