Pitman MTH 135/ STA 104 Probability Week 5
Read:  Pitman sections 3.1-3.3

Discrete Random Variables

* Introduction to Joint Distributions of Random Variables


Roll a fair die until an ace (1) appears; how many non-aces do you see first?
This is an example of a *RANDOM VARIABLE*, a number that depends on chance.

a)  What *is* a random variable?

    One answer:      A function from the sample space to the real numbers |R
    Another:         A number that depends on chance
    Secret:          Usually upper-case letters from the end of the alphabet
                     are used... so if you see X, Y, or Z, it's probably a RV

    Let's call the number of non-aces X.  

b)  What questions can we ask & answer about random variables?

    One:         P[ X < 3 ]  = 1 - (5/6)^3 = 1 - 125/216 = 91/216 = .4213
    Another:     P[ X = 2 ]  = P[ X >= 2 ] - P[ X >= 3 ]
                             = (5/6)^2 - (5/6)^3 = 25/36 - 125/216 = 25/216 = .1157
                     OR      = P[~A ~A  A] = (5/6)(5/6)(1/6) = 25/216 = .1157

    Yet Another: What would X be, on average, in lots of repeated trials?
    Variation:   Instead of P[Ace]=1/6, count # of failures before 1st success
                 if successes have probability p, 0<p<1:
                 P[ X = x ]  = (1-p)^x p = p q^x, where q=(1-p) is prob of failure. 

    Specific Examples:

                  Draw 3 balls numberd 1-20, no replacement; prob one is >=17:
                             = 1 - (16/20)*(15/19)*(14/18) = 29/57 = .5088
                  X = max number selected; what are the possible values of X
                      and their probabilities?

                  P[X=20] = 3/20                                    = .1500
                  P[X=19] = 3 * (18/20) * (17/19) * (1/18) = 51/380 = .1342
                  P[X=18] = 3 * (17/20) * (16/19) * (1/18) = 34/285 = .1193
                  P[X=17] = 3 * (16/20) * (15/19) * (1/18) =  2/19  = .1053
                  P[X>=17]=                                           .5088
                  P[X=x]  = 3 * (x-1)*(x-2)/(20*19*18) = (x-1)(x-2)/2280,
                            x = 3,4,...,20

                  Another way:  P[X=x] = (x-1:2) / (20:3) (also correct)

  DEF:
  A (real-valued) RANDOM VARIABLE is a (real-valued) function on the sample
  space Omega.

  Example:  if Omega is the usual 36-point space for two rolls of a fair die,
            say, { (r,g) :  1 <= r,g <= 6 } all equally-likely, then

         X(r,g) = r
         Y(r,g) = |r-g|
         Z(r,g) = r+g

  are all random variables.  What is the probability that Y=1?  What is that
  EVENT?

  DEF:  The *RANGE* or a random variable is just the set of its possible
        values.

        The *DISTRIBUTION* of a random variable is any specification of

                      P[ X in A ]

        for every set A...  if X has only finitely-many (or countably-many)
        values, the DISTRIBUTION can be specified by giving the probability
        of each outcome in the range,

                    f(x) = P[ X = x ]

        and then P[ X in A ] = sum { f(x) : x in A } is specified for every
        A.

        For other random variables, like "uniform" and "normal" among others,
        we'll have to do something else--- we start that just after Fall
        Break.  It's always good enough to specify

                    F(x) = P [ X <= x ]

        for every x; then we can work out the probability that X is in any
        interval, any union of intervals, etc; more later.

   If X is any random variable and g is any function, then Y = g(X) is another
   random variable:
                       X             g

             Omega   ----->  |R  --------->   |R

   Actually, X could be a function from Omega to any set at all (say, "E")
   and g could be a function from E to the real numbers, and we'd still be
   okay.

   If X is discrete with pmf f(x) = P[X=x],

                  What is the DISTRIBUTION of Y = g(X) ???

   P[ Y = y ] = P[ g(X) = y ] = SUM { P[ X=x ]:  g(x)=y }
              = P[ X in g^{-1}(y) ]

   -------------

   Random Vectors and Joint Distributions

   Draw two socks at random, without replacement, from a drawer full of
   twelve colored socks:

                    6 black  4 white  2 purple

   Let B be the number of Black socks, W the number of White socks drawn.

   The *DISTRIBUTION* of B and W are easy to write down; each has only 3
   values in its range, with probability table (why?).  To make it easier
   to compare & add numbers, I'll put everything over the same denominator
   instead of our usual convention of "lowest terms":

                    0        1        2
              B   15/66    36/66    15/66    (6:b)(6:2-b)/(12:2)      (**)
              W   28/66    32/66     6/66    (4:w)(8:2-w)/(12:2)

  This table doesn't let us know everything--- for example, what is the
  probability that we draw a matching pair?  What's the probability that
  we have one each of black and white socks?  We don't have enough to
  tell (e.g., we can't tell about the probability of a purple pair).

  The *JOINT* distribution of B and W tells us the probability of every
  possible PAIR (b,w) of numbers...  we can present it in a formula
  P(b,w) = (6:b)(4:w)(2:2-b-w)/(12:2) or in a table:

                               W
                    0          1        2
                +------------------------------
              0 |  1/66       8/66     6/66  ||     15/66
                |                            ||
          B   1 | 12/66      24/66      0    ||     36/66
                |                            ||
              2 | 15/66        0        0    ||     15/66
                |                            ||
                ===============================

                  28/66      32/66     6/66         66/66

Note that the MARGINAL SUMS are the same numbers we had before in (**);
they are called the *MARGINAL DISTRIBUTIONs* of B and W.

Now we can see the probability of a matching pair:

             Black   White    Green

             15/66 +  6/66 +  1/66  =  22/66  = 1/3.

or the probability of a black-and-white pair, 24/66 = 4/11.

-----------------------------------------------------------------------------
* EXPECTATIONS *

We can use the JOINT distribution   P[ X=x, Y=y ] to find expectations
of functions of any two discrete random variables X and Y :

       E[ g(X,Y) ] = SUM{ g(x,y) P[ X=x, Y=y ] }

For example, above the expectation of the PRODUCT

           G(B,W) = B * W

of the numbers of Black and White socks is

       E[ B * W ] = 0*0* 1/66 + 0*1* 8/66 + 0*2*6/66
                  + 1*0*12/66 + 1*1*24/66 + 1*2*0
                  + 2*0*15/66 + 2*1*0     + 1*2*0
                  = 24/66 = 4/11.

Why was that obvious already????

Note this cannot be calculated from the *marginal* distributions of
B and W--- and (in particular) it is NOT THE SAME as 

E[ B ] * E[ W ] = { (36+30)/66 = 66/66 } * { (32+12)/66 = 44/66 }
                = { 1 } * { 2/3 } = 2/3

DEFINITION:  The *COVARIANCE* of two RVs is:

    Cov(X, Y) = E[ (X-mu_x) * (Y-mu_y) ]
              = E[ X*Y ] - mu_X * mu_Y

so, here,  Cov(B, W) = 4/11 - 2/3 = (12-22)/33 = -10/33 = -0.30303

Let Z = a*X + b*Y + c; what are the MEAN and VARIANCE of Z ?

    E[ Z ] = a*mu_X + b*mu_Y + c

  VAR[ Z ] = E{ [a*(X-mu_X) + b*(Y-mu_Y) ]^2 }
           =     a^2 E[ (X-mu_X)^2 ]
             + 2*a*b E[ (X-mu_X) (Y-mu_Y) ]
             +   b^2 E[ (Y-mu_Y)^2 ]
           = a^2 sigma^2_X + b^2 sigma^2_Y + 2 a b Cov(X,Y)

  For example:

  VAR[ X + Y ] = sig_X^2 + sig_Y^2 + 2 Cov(X,Y)

  VAR[ X - Y ] = sig_X^2 + sig_Y^2 - 2 Cov(X,Y)

======================================================================
* Conditional Distributions *

The marginal probability of w white sox in the draw is:

               28/66 = 14/33      for w=0;
  P[ W = w ] = 32/66 = 16/33      for w=1;
                6/66 =  3/33      for w=2.

But what if we KNOW that we drew ZERO BLACK socks?
*Then* the *conditional distribution* of W would be:

                        1/15      for w=0;
  P[ W = w | B = 0 ] =  8/15      for w=1;
                        6/15      for w=2.

(note it's much more likely now for W=2).  More generally, for any
two discrete random variables X and Y, the *CONDITIONAL DISTRIBUTION* is

                          P[ X=x, Y=y ]         joint pmf
  P[ X = x | Y = y ] = ------------------- = ---------------
                             P[ Y=y ]          marginal pmf

These can be used just like any other distribution to calculate, for
example, the *conditional* mean and variance (see below):

  E[ X | Y=y ] = SUM { x * P[ X=x | Y=y ] }

For example,

  E[ W | B=0 ] = 0*(1/15) + 1*(8/15) + 2*(6/15) = 20/15 = 4/3
E[ W^2 | B=0 ] = 0*(1/15) + 1*(8/15) + 4*(6/15) = 32/15

                                    96 - 80
Var[ W | B=0 ] = 32/15 - (4/3)^2 = --------- = 16/45 = 0.3555556
                                       45

-----------------------------------------------------------------------------
* INDEPENDENCE *

Two random variables X and Y are *INDEPENDENT* if their joint pmf factors:

   P[ X=x , Y=y ] =    P[ X=x ] * P[ Y=y ]

as the product of the marginal pmfs 
(  IF it factors at all as ANY product f(x) * g(y),
 THEN it factors as the product of marginals.  Why?)

For independent X,Y, the covariance vanishes:

  Cov[ X,Y ] = E[ (X-mu_X) (Y-mu_Y) ]
             = SUM { (X-mu_X) * (Y-mu_Y) * P[ X=x, Y=y ] }
             = SUM { (X-mu_X) * (Y-mu_Y) * P[ X=x ] * P[ Y=y ] }
             = SUM { (X-mu_X) * P[ X=x ] } 
             * SUM { (Y-mu_Y) * P[ Y=y ] }
             = (mu_X - mu_X)  *  (mu_Y - mu_Y) = 0 * 0 = 0

and so the variance is simply

  Var[ a*X + b*Y ] = a^2 Var[X] + b^2 Var[Y]

For a=1 and b=1 or b=-1,

  Var[ X + Y ] = Var[X] + Var[Y]  ***AND*** Var[ X - Y ] = Var[X] + Var[Y]

-----------------------------------------------------------------------------
* Expectations

If we draw some RV X repeatedly & independently, what will be its AVERAGE VALUE?

For example, if we roll a fair die 600 times, what will the average be?  If
we denote the outcome on the i'th roll by X_i this looks like:

               X_1 + X_2 + X_3 + ... + X_600
   Avg   =   ---------------------------------
                          600

and it's a little hard to tell.  BUT--- if instead we think of how many 1's
we will find, and how many 2's, and how many 3's and 4's and so forth, we
see the sum should be exactly

       X_1 + X_2 + X_3 + ... + X_600  =   1 * (# of 1's in 600 rolls
                                        + 2 * (# of 2's in 600 rolls
                                        + 3 * (# of 3's in 600 rolls
                                        + 4 * (# of 4's in 600 rolls
                                        + 5 * (# of 5's in 600 rolls
                                        + 6 * (# of 6's in 600 rolls

which should be about (why?)          ~~  1 * 100 + 2 * 100 + 3 * 100
                                        + 4 * 100 + 5 * 100 + 6 * 100 =  2100

so the average should be about 

                2100
     Avg ~   -----------     =  3.5
                 600

More generally, if we have any function g() and want to know the average
value of g(X) for a random variable X that takes each value x with
probability f(x), then in a large number N of tries the average will be 
about

                      Sum  [ g(x)  *  N * f(x)  ]  
   Avg[ g(X) ] ~~  -----------------------------------  =  Sum  g(x) * f(x)
                                 N

(note the N cancels top-and-bottom, so we can take the limit N->oo easily)

This is a *weighted average of g(x)*, weighted by the PROBABILTY that X=x;
the fair-die example had

    g(x) = x                        f(x) = 1/6 for x = 1,2,3,4,5,6

----------------

DEFINITION:

   The MEAN of X is             E[X]    = Sum { x * f(x) }  (usually denoted "mu")

   The EXPECTATION of g(X) is   E[g(X)] = Sum { g(x) * f(x) }

Nobody will get upset if you mix up the words MEAN and EXPECTATION.

Note that mu has the same UNITS as X does--- if X is measured in feet,
meters, seconds, or fortnights then so is mu.

-----------------

On average, any RV X will be equal to its mean "mu"...  but how far from mu
will X be?

  *  Can't measure this by the average of    (  X - mu )

     (that average is always zero); the points where X>mu are balanced out
     by the points where X<mu, so the difference cancels out and averages to 0.


  *  We need a measure of "far from zero" that will be small when X~mu and
     big BOTH when X>>mu AND when X<<mu.  We COULD measure it by the average
     of     | X - mu |,  but that turns out to be too hard...  more later.

  *  Something that turns out to be easier is:

DEFINITION:
                                       2
  The VARIANCE of X is     E[  (X - mu)   ];

  it will be small if X~mu most of the time, big if X is far from mu.  It is
  often denoted "sigma^2" or "V[ X ]"

  Note that the UNITS of sigma^2 are the SQUARE of the units of X...  if X is
  measured in feet or meters then its variance is measured in SQUARE feet or
  m^2 (AREA not distance); if X is measured in seconds or fortnights then its
  variance has units of square time.

  SQUARE time or distance are harder to think about--- so it is common to
  report the square root of the variance, or

DEFINITION

  The STANDARD DEVIATION of X is the square root of its variance, often
  denoted by sigma or abbreviated "sd".

  Note sd has the SAME units as X--- feet or meters, seconds or fortnights.

----------------

* Standard Deviation & Normal Approximations

NORMAL APPROXIMATIONS   

  The *normal distribution* is characterized by just two numbers--- the mean
  and the standard deviation (or, equivalently, variance).

  Often we will approximate the distribution of some variable X by a normal
  distribution--- the best approx is usually the one with the SAME mean and
  the SAME standard deviation.

  Recall the DeMoivre-Laplace limit theorem where we approximated the
  binomial distribution with a normal---   as you'll see in the problems
  or chapter,

  BINOMIAL:

    E[X]  =   N * p                  (why?  how many successes do you expect
                                            in  N independent tries?)

    V[X]  =   N * p * (1-p)          (what does this mean for N=1?  Later
                                            we'll see why it's right for
                                            every N)

  SO, Demoiv-Laplace is approximating Binomial with the Normal with same mu,
  sigma.

==============================================================================