This lab shows how to generate random numbers and use them to do probability calculations. For many of our calculations we will need the sample() function, which works like this:

sample ( x, size=, replace=, prob= )

You must specify x. It is the population to be sampled. You may also specify the sample size, whether to sample with or without replacement, and whether to sample with unequal probabilities. These variations should become clear through the examples below.

Begin with a simple example, rolling a six-sided die, to clarify ideas. Rolling a die is like choosing a number from 1 to 6, so x should be the numbers 1, 2, 3, 4, 5 and 6. Let's roll the die 400 times. Rolling the die is like sampling with replacement. So we say

die _ sample ( 1:6, 400, replace=T )

The result should be 400 numbers, each one between 1 and 6. Type die to see them. In the long run we expect about 1/6 of the rolls to be "1", 1/6 to be "2", and so on. In 400 rolls that's 400/6 = 66.66.... To see how our rolls came out we can use the following loop, which asks Splus to tell us how many 1's there are, how many 2's, etc.

for ( i in 1:6 )
print ( sum ( die==i ))

The meaning of the loop is:
Set i equal to 1
Print the total of all the rolls equal to i
Repeat for i=2, i=3, ..., i=6

Of course we didn't get exactly 66.66... 1's, 2's, 3's, etc. One thing we'll study later in the course is how far off 66.66... we expect to be, typically, in 400 rolls of a die.

Now let's use the sample() function to answer questions 7-10 from Quiz 4. Understanding these problems in S-Plus will give you another way to visualize the formula for conditional chances from last Friday's class. Questions 7-10 talk about families of three children and how many of them are girls. Let's generate a sample of 10,000 families, keeping track of the sex of the children.

Child1 _ sample ( c("B","G"), 10000, replace=T )
Child2 _ sample ( c("B","G"), 10000, replace=T )
Child3 _ sample ( c("B","G"), 10000, replace=T )

To see what you've just done, use commands like
Child1[1:50]
which shows the sex of the oldest child in the first 50 families.

Question 7 asks for the conditional chance that the second child is a girl given that at least one child is a girl. To answer this question we want to select all the families in which at least one child is a girl and, of those, find out in how many the middle child is a girl. That is, we want the fraction
(# with middle=G and at least one G)/(# with at least one G).
The next two commands pick out the families with at least one girl and add them up. Recall that "==" tests whether two things are equal; "|" is S-Plus for "or" and "&" is S-Plus for "and".

at.least.1.girl _ Child1 == "G" | Child2 == "G" | Child3 == "G"
sum ( at.least.1.girl )

The next commands pick out the families in which the middle child is a girl, add up the families who satisfy both conditions, and calculate the ratio.

middle.is.girl _ Child2 == "G"
sum ( middle.is.girl & at.least.1.girl ) / sum ( at.least.1.girl)

How does that compare to my answer of 4/7? Just type 4/7 in Splus and compare. My simulation result was 0.570056, while 4/7 is 0.5714286. That's pretty close.

Now let's answer questions 8-10. See whether you can figure out what the Splus commands do.

first.two.are.girls _ Child1 == "G" & Child2 == "G"
sum ( first.two.are.girls & at.least.1.girl ) / sum ( at.least.1.girl)

at.least.2.girls _ (Child1=="G"&Child2=="G") | (Child1=="G"&Child3=="G") | (Child2=="G"&Child3=="G")
sum(at.least.2.girls)/10000

sum(at.least.2.girls & middle.is.girl)/sum(at.least.2.girls)

How do those compare with my answers of 2/7, 1/2 and 3/4?

Let's try question 5 from the quiz. In two rolls of a die, what is the chance of at least one 2?

Die1 _ sample ( 1:6, 10000, replace=T )
Die2 _ sample ( 1:6, 10000, replace=T )
sum ( Die1==2 | Die2==2 )/10000
1/6 + 1/6 - 1/36
# my answer

Question 4 is about cards. There are 52 cards in a deck. Let's number them 1-52, where cards 1-13 are the ace, 2, 3, ..., king of clubs; cards 14-26 are the ace, 2, ..., king of diamonds; cards 27-39 are the hearts and 40-52 are the spades. There are two new things we'll have to worry about. The first is that when we deal two cards, that's like sampling without replacement. The second is how to look for an ace or a king, or a queen, etc. One way to look for an ace is to see whether the card is either card 1 or 14, or 27 or 40. Another way is to divide the number by 13 and see whether the remainder is 1. In Splus %% gives remainders. For example:
14 %% 13 = 1
15 %% 13 = 2
29 %% 13 = 3
10 %% 13 = 10.
Here are some commands to deal 2 cards, 10000 times, and find how many had first and ace, then a 10, jack, queen or king. Don't just copy them; try to figure out what they do. Ask for help if you can't figure it out.

bj _ 0
for ( i in 1:10000 ) {
Cards _ sample ( 1:52, 2, replace=F )
if ( Cards[1]%%13==1 &
(Cards[2]%%13==0 | Cards[2]%%13==10 | Cards[2]%%13==11 |
Cards[2]%%13==12) )
bj _ bj+1
}
bj/10000

Be patient. S-Plus is very slow with loops.
Of course a blackjack can happen with either the ace first or the ace second. So the answer we want is twice what we've just calculated, or

2*bj/10000

For a final example, let's return to the Slater School example we discussed in class. Recall that there were 145 women employees. 8 of them developed the particular cancer in question. A national data base says that about 3% of women nationally develop this cancer. Teachers at Slater suspected that their cancer rate was increased by the nearby high tension lines. A relevant calculation is the chance that 8 or more women out of 145 would develop the cancer just by luck of the draw. To do the calculation we will simulate 1000 Slater schools, each having 145 employees.

Imagine a box with 100 tickets in it. Three tickets have the number "1"; 97 tickets have the number "0". Drawing a "1" is like getting the cancer. We will draw a ticket out of the box for each employee. For each set of 145 employees we will count how many get a "1".

To keep track of everything, we will put the employees into a big matrix. Each row of the matrix will represent a group of 145 employees. There will be 1000 rows. We will need to draw 145*1000 tickets from the box, with replacement. Note the use of sample( ..., prob=c(.97,.03)) to draw with unequal probabilities and the matrix ( ..., 1000, 145 ) to put the draws into a 1000 by 145 matrix.

slater _ matrix ( sample(0:1, 145000, replace=T, prob=c(.97,.03) ), 1000, 145 )
rowsums _ rep(NA,1000)
# create 100 missing values that will eventually hold the row sums
for ( i in 1:1000 )
rowsums[i] _ sum ( slater[i,] )
sum ( rowsums > 7 ) / 1000

That last calculation gives the chance that more than 7 women out of 145 develop the cancer, at least according to our simulation. For me, the answer was sum ( rowsums > 7 ) / 1000 = 0.077. I also did an exact calculation based on theory instead of simulation. I got 0.07170263. We will see how to do the exact calculation in the next chapter of FPP.