STA114/MTH136 Lecture Notes, Week 1: Intro & Examples

1. Conditional Probability and Inference

   All probabilities and probability distributions are "conditional"
   on the background information, information, and knowlege brought
   to bear in assessing the probability or distribution.

   First a simple example, to help illustrate the special role
   conditional probability plays in INFERENCE, the topic of this course.

EG1:  Rare Disease

   Suppose that one percent of the population have a genetic condition
   that predisposes them for a disease, and that the genetic condition
   can be revealed only by a clinical test.  

   Like all clinical tests, this one is imperfect; still, it's a pretty
   good one.  For subjects WITH the condition, it detects the condition
   with probability 98%; for subjects WITHOUT the condition, it falsely
   gives a positive result with probability 2%.

   For a subject drawn at random from the population, let's find:

       a) The probability of the genetic condition;
       b) The probability of a positive test;
       c) IF the test is positive, the conditional probability of the GC;
       b) IF the test is false, the conditional probability of the GC.

   Let's introduce some notation; let THETA be the "disease state"
   (1 for presence, 0 for absence of the genetic condition), and let
   T be the test result (1 for "positive", 0 for "negative").  Now we
   can recognize our four questions as:

       a) P[Th = 1] =
       b) P[T  = 1] =
       c) P[Th = 1 | T = 1] =
       d) P[Th = 1 | T = 0] =

   These are familiar computations from STA104/MTH135 - level
   probability:

       a) P[Th = 1]         = (0.01)  (this is given in the problem)
       b) P[T  = 1]         = (0.01)*(0.98)+(0.99)*(0.02) = .0296 = 3%
       c) P[Th = 1 | T = 1] = (0.01)*(0.98)/0.0296 = 0.0098/0.0298 = 0.3289
       d) P[Th = 1 | T = 0] = (0.01)*(0.02)/0.9704 = 0.0002/0.9704 = 0.0002
 
   The (perhaps surprising) conclusion: a subject with a positive test
   is still twice as likely to be free from the condition as to have it
   (i.e. "false positives" are a serious problem); still, a subject with
   a negative test can be reasured.

   This is really a problem in INFERENCE:  we have an unknown feature
   (Th) about which we have some evidence (T), and conditional
   probability gives us the link.

   BEFORE the evidence, P[Th=1] = 0.01;
   AFTER  the evidence, P[Th=1 | T=1 ] = .3289
                        P[Th=1 | T=0 ] = .0002
  
   
EG2:  New Disease

   Now let's go another step: suppose a new condition becomes of
   clinical interest, and we have little idea what is the prevalence
   of the condition; let's denote the probability of the condition
   by P, 0<P<1, and suppose that we have an error-free way of detecting
   the condition.  We take a sample of N individuals from the population
   and count how many have the condition; let's say X do, and N-X do
   not, have the condition.

   What is P????

   If we can view the N subjects as independent, then X is the:

      * Total # of Successes in a
      * Fixed number (N) of 
      * Independent Trials, each with the
      * Same probability (P) of success

   and, as such, has a Bi(N,P) distribution... GIVEN the values N and P.
   While we can observe N, we do not know P.  How can we proceed?  To
   make the example more concrete, let's say N=10 and X=3.  SO,

   If we observe X=3 with the condition in a simple random sample of
   N=10,  WHAT IS P????

   In some sense the question is unanswerable....  ANY P in (0,1)
   COULD be the unknown and unknowable P.  But after observing 3
   of 10 with the condition we have a much better idea than before
   of what P can be---- it is unlikely to be as high as 90%, for
   example (else we should have seen many more than 3 in our sample)
   while it is unlikely to be as low as 1% (else we couldn't expect
   to see even 1, much less 3, in a sample of only 10).  


   *** Three Paradigms ***

   A variety of approaches have been proposed over the past three
   centuries for this central problem of Statistics, and with them
   a variety of answers.

   In the 1700's Pierre Simon Laplace and Thomas Bayes had the idea
   of treating the binomial distribution as a CONDTIONAL distribution
   for X, GIVEN P; and of treating P as a random variable with a
   marginal UNIFORMLY DISTRUBUTION.  The idea is that our "complete
   ignorance" about P before the experiment can be expressed by the
   statement 

      Pr[ a < P < b ]  =  b-a                 0 < a < b < 1
   
   (so Pr[P<50%] = 0.5, etc);  now that we have a marginal distribution
   for P and a conditional distribution for X given P, the rest is easy:

      f(p,X) = f(p) * f(X | p)                     3      7
             =  1   *  (10:3) p^3 (1-p)^7  =  120 p  (1-p)

              1
             /
      f(X) = |  f(p,X) dp   (don't worry, we won't need to evaluate this)
             /
             0
                                    3     7             3     7
      f(p | X) = f(p,X)/f(X) = 120 p (1-p) / f(10) = C*p (1-p), 0<p<1

    SO, C is whatever it has to be for f(p|X) to be a pdf, i.e.

      C = 1/Integrate[p^3 (1-p)^7, {p,0,1}] = 1320

    and f(p|X=10) = 1320 p^3 (1-p)^7, 0<p<1.  You might recognize this
    as the Beta distribution with parameters a-1=3, b-1=7, i.e.,

       P ~ Be(a=4, b=8)

    [Homework: Plot this as a function of P] 

    How does this answer our question, "What is P"?????  Now we know
    just how unlikely are the events [P>.90] and [P<.01]  [Homework:
    calculate them], and much more... e.g. 

    What is the most likely value of P?

          [ Ambiguous question..... Pr[P=x] = 0 for every x.
            Still, the PDF for P has a maximum which we can find,
            by calculus or plotting, as P-Hat = 3/10 = 0.30   ]

    What is the expected value of P???

          [ Not so ambiguous:  E[ P | X=3 ] = 4/12 = 1/3, from
            our familiarity with the Beta distribution ]

    Is it possible that P=1/2?  That P>=1/2?

          [ Well, Pr[P=.5] = 0; but Pr[P>=.5] = 0.11328, so yes
            it IS possible ]

    Is it possible that P>.90?
          [ Giving away a HW answer, P[ P > .90 | X=3 ] = 1.248e-6
            so it's "possible", but not really believable...
            about one in a million chance ]

*****************

    Two other paradigms:

**  In the 1930's and 40's, Sir Ronald Fisher proposed an entirely
    different way of thinking about inference.  It's harder than
    the Laplace/Bayes approach, and leads to less useful answers,
    but the calculations could be done without computers and so
    it was the most successful way of doing statistics for most of
    the 20'th century and still is important.

    The idea is to NOT use probability to think and talk about P,
    but to play hypothetical games, as follows:

    IF P=.90, THEN X=3 has very small probability
             (10:3) .9^3 .1^7 = 8.75e-6

    IF P=.40, THEN X=3 has rather larger probability
             (10:3) .4^3 .6^7 = 0.215

    IF P=.30, THEN X=3 has the largest possible probability
             (10:3) .3^3 .7^7 = 0.267

    IF P=.01, THEN X=3 has very small probability
             (10:3) .01^3 .99^7 = 0.000112

    THEREFORE,  IF P is as large as .9 or as small as .01,
                   THEN we have just seen a miracle; but
                IF P is around 0.30 or so,
                   THEN we have seen just what we should expect.

    Since miracles are unusualy, we conclude that P=.30 or so.

**  Both the Bayes/Laplace approach and the Fisherian approach (also
    called "frequentst" or, ironically, "classical") use probability
    in a fundamental way to derive and defend their estimates of P;
    the third school, "LIKELIHOODISTS", are more abstract and treat
    the "likelihood function"

        L(p) = (N:X) p^X  (1-p)^(N-X)

    as quantifying the "evidence" about p as a possible value of P;
    the point with greatest evidence is (again) P-Hat = 0.30, and
    the points with at least 50% of this much "evidence" would be
    those for which L(p) > 1/2 * L(P-Hat).  This is by far the least
    common of the three paradigms, at present, so we will concentrate
    our limited time on the two dominant ones:  Bayesian and
    Frequentist.

2. Conditional Probability Distributions (DeG ch 2, 3.6-7)

   Since so much of our attention will be focussed on conditional
   probabilities and conditional probability distributions, it's a
   good idea to review them and explore them a bit more deeply.

   Please give a careful read to all of DeGroot's Chapter 2 and to
   sections 6-10 of Chapter 3, from which most of the HW exercises 
   are drawn this week. 

================================================================

First, some review of Conditional Probability Distributions and a
fun example, called "Borel's Paradox"; before we can do that, we
should review change of variables (COV) in one and in d dimensions.

Let X be a continuously-distributed random variable with pdf

                 X ~ f(x)

and let          U = g(X)

for some nice function g: R -> R .  If g(x) is 1:1 and differentiable,
recall the COV "magic formula" for the density function f(u):


                  f(x)
      f(u)  =  -----------            (*)
                | g'(x) |

where x is the (by assumption, unique) solution to "g(x)=u" and where

      g'(x) = d g(x) / dx

is the derivative, and |g'(x)| its absolute value.  Here, due to the
difficulty of faking subscripts in plain text, we use the same notation
f( ) for the density functions of both U and X...  but think of them as
separate functions, perhaps as "f_U (u)" and "f_X (x)".

If the function g(x) is differentiable but not 1:1 (think of sin(x)), we
simply add up similar terms over all the preimages:


              --    f(x)
      f(u)  = >  -----------        (**)
              --  | g'(x) |

where the sum extends over all {x: g(x)=u} (and so is zero if there are
NO such x's).

--------

The vector-valued version of COV is almost identical to this---  if X
is a d-dimensional random vector with pdf f(x) (now a function on R^d)
and if g is a nice 1:1 differentiable function from R^d -> R^d, then
again equation (*) holds, but now we must interpret the derivative 
carefully.  It's conventional to use different notation too, beginning
with the Jacobian MATRIX

                      | dg1/dx1  dg1/dx2 ...  dg1/dxd |
           d g        | dg2/dx1  dg2/dx2 ...  dg2/dxd |
     J = --------  =  |   ...      ...    :     ...   |
           d x        | dgd/dx1  dgd/dx2 ...  dgd/dxd |


whose (i,j)'th entry is the partial derivative of g_i(x) with respect
to x_j.  Now the notation |J| represents the absolute value of the
determinant of this Jacobian matrix...  in the 2x2 case we will see
almost exclusively, this reduces to

     |J|  =  |  dg1/dx1 * dg2/dx2 - dg1/dx2 * dg2/dx1 |

Sometimes it's easier to compute derivatives dx_j/du_i of x_j wrt
u_i=g_i than the (needed) ones; this is good enough, since they are
numerical inverses:
                      dg_i/dx_j = 1/(dx_j/du_i) 

It's also possible to compute the OTHER Jacobian "j", the matrix of
derivatives of x with respect to u=g(x), and use the matrix relation

                     |J| = 1 / |j|
---------

Two Examples:

(1) Polar Coordinates:

   Let (R,TH) = g(X,Y):

    R = sqrt(X^2 + Y^2)      X = R cos(TH)   0 <  R <  oo
   TH = arctan (Y/X)         Y = R sin(TH)   0 < TH < 2 pi

By the Chain Rule, the derivatives are


    dR/dX = 2X * (1/2) * (X^2+Y^2)^(-1/2)  =  X/R
    dR/dY = 2Y * (1/2) * (Y^2+Y^2)^(-1/2)  =  Y/R

   dTH/dX = 1/(dX/dTH) = 1/(-Y) = -1/Y
   dTH/dY = 1/(dY/dTH) = 1/(X)  =  1/X


       | X/R   Y/R |
so J = |           | = | 1/R + 1/R | = 2/R
       |-1/Y   1/X |

and we find

   f(r,th)  =  (R/2) f(x,y)

For example, if X and Y are independent normally-distributed random
variables then
                        1     - (x^2 + y^2)/2      r     - r^2/2
   f(r,th)  =  (r/2) ------  e                  = ----  e 
                      2 pi                        4 pi

so R and TH are independent, with uniform distribution for TH, i.e.
with marginal density functions

   f(th) = 1/2pi,             0<th<2pi)
   f(r)  = r/2 exp(-r^2/2),   r>0

Note r^2 has an exponential distrbution with mean 2.

---------------

(2) Sum, Difference, and Ratio of Exponentials:

Now let X,Y be independent Ex(1) random variables, with joint pdf
              -x-y
    f(x,y) = e     ,  0<x<oo  0<y<oo      (and zero if either x<0 or y<0)

and consider the sum and difference

    S = X+Y      dS/dX =  1  dS/dY = 1
    D = Y-X  so  dD/dX = -1  dD/dY = 1  and the Jacobian is J=2,

leading to pdf

              f(x,y)
    f(s,d) = --------  =  (1/2)exp(-s)  IF 0<x,y<oo, i.e. -s < d < s.
                 2

The only difficulty here is working out the set of (s,d) where f(s,d) is
nonzero; but you've got to get that right to get any further.

The marginal distributions are available by intergration:

            /
    f1(s) = | f(s,d)dd  =   2s*(1/2)*exp(-s) = s exp(-s), s>0
            / 

so S has a Gamma(2,1) distribution, while

            /            / oo                           -|d|
    f2(d) = | f(s,d)ds = |    (1/2)exp(-s) ds  = (1/2) e     , -oo<d<oo
            /            /
                          |d|

so d has a double-exponential distribution centered at zero.

------------

Now let's look at the sum and RATIO:

       S = X+Y    X =  S/(1+R)    dS/dX =   1     dS/dY =  1
       R = Y/X    Y = SR/(1+R)    dR/dX = -Y/X^2  dR/dY = 1/X

so the Jacobian is

     J = | 1/X + Y/X^2 |  = | S/X^2 | = (1+R)^2/S

and the joint pdf for S,R is

    f(s,r) = f(x,y)/|J| = s/2(1+r)^2 exp(-s) if s>0 and r>0,

so S and R are independent with marginal density functions

    f1(s) = s exp(-s), s>0         (just as before.... whew!)
    f2(r) = 1/(1+r)^2, r>0

----------------------------------------

****************** CONDITIONAL DISTRIBUTIONS *******************


For any two jointly-continuous random variables X,Y with joint pdf

               X,Y ~ f(x,y)

the conditional distribution of X, given Y, is given by

                      f(x,y)
            f(x|y) = --------  = c*f(x,y),
                      f2(y)

where the "constant" (for x only!!!) c=1/f2(y) is whatever it has to
be for f(x|y) to be a density function in x, i.e., for it to have
integral 1.  Similarly

                      f(x,y)
            f(y|x) = --------  = c*f(x,y),
                      f1(x)

where now c=1/f1(x).  Thus the joint pdf is proportional to BOTH
conditional density functions.

EXAMPLES:

 (Q1) Suppose X and Y are independent, with standard exponential
      distributions; what is the conditional distribution of S=X+Y,
      given that the difference D=Y-X is zero?

 (A1) Above we found the joint pdf of S and D, and also the marginals,
      so in general

          f(s|d) = f(s,d)/f2(d) = (1/2) exp(-s) / (1/2) exp(-|d|)

                 = exp(|d|-s),  for  |d| < s < oo

      and, for d=0,

          f(s|d=0) = exp(-s), s>0

      so S has a standard exponential distribution (with mean one),
      given that X=Y.

 (Q2) Suppose X and Y are independent, with standard exponential
      distributions; what is the conditional distribution of S=X+Y,
      given that the ratio R=Y/X is one?

 (A2) Above we found the joint pdf of S and R, and also the marginals,
      so in general

          f(s|r) = f(s,r)/f2(r) = s/2(1+r)^2 exp(-s) * (1+r)^2
                 = (s/2) exp(-s),  s>0

      and, for r=1,

          f(s|r=1) = (s/2) exp(-s),  s>0

      so S has a Gamma(2,1) distribution (with mean two), given that X=Y.

-------------------

Note the apparent paradox--- if (X,Y) lies on the diagonal, what is the
expectation of X+Y?  Is it ONE (that's the expectation, given that
Y-X=0) or is it TWO (that's the expectation, given that Y/X=1)????????

The difference is only paradoxical if we attach any meaning to the
event that "(X,Y) lies on the diagonal"... but this event has
probability ZERO.  For continuous random variables, we must be a little
careful in how we define and interpret conditioning.

-------------------