STA114/MTH136 Lecture Notes, Week 8: Hypothesis Testing

Introduction: Tiny Toy Example:

Suppose I toss a coin 10 times and claim I find 10 Heads.
Do you believe that this is a fair coin?
        [ presumably, no ]
Even though the probability of ANY specific sequence is 1/1024?
        [ still no ]
WHY?

Answer:
  Because such an extreme outcome is very unlikely, for a fair
  coin, but you can easily imagine me lying or using a 2-headed
  coin or otherwise NOT having the number X of heads following
  a Bi(10,0.5) distribution.

  Thus,
    EITHER:  It's a fair coin and I just witnessed a near-miracle
    OR:      It's not a fair coin.

  The skeptical position is to deny (most) miracles, and conclude
  that probably the coin is unfair--- even though it is quite
  possible for a fair coin to fall heads 10 times in a row.  The
  probability of 10 heads or tails is 1/512 = 0.001953125 < 0.002.

=============  Classical Hypothesis Testing  =============

A "Hypothesis" is an assertion about the *population*, i.e. the
probability distribution of the data, *not* about the data themselves.
Classical hypothesis testing is concerned with discovering and reporting
whether or not the observed data X are consistent with a possible
hypothesis Ho.  Formally, what most investigators do is to pick a small
number alpha>0 (.05 and .01 are popular) and report a "p-value":

   p = Prob[ Data as extreme or more so AGAINST Ho | Ho true ]

This measures "how big a miracle" you've seen; SMALL p is regarded as
STRONG evidence AGAINST Ho, so:

          REJECT Ho if p <= alpha
  FAIL TO REJECT Ho if p >  alpha

--------

Of course you've got to figure out which outcomes ARE "more extreme"
than the ones you saw... and there can be lots of different ways of
doing this.  In principle you're free to choose ANY statistic T(X)
that you'd expect to be small when Ho is true and big when Ho is false,
then report as p-value

       p = Prob[ T >= Observed t | Ho true ]

If Ho is true then this "p-value" will have a Uniform distribution on
(0,1); if false then T should be bigger than "reasonable" and so p will
be smaller than typical U[0,1]'s.

The idea is that you Reject Ho with probability alpha when in fact Ho is
true (a mistake called a "type I error"), but hopefully you Reject Ho
with a much higher probability (the "power") when Ho is false.  The
other kind of error you might make is a "type II error", Failing to
reject Ho when Ho is false; it's probability is called

       beta = Prob [ Reject Ho | Ho false ]

--------

Testing at level alpha, as described above, is always equivalent to
the following procedure:

  Compute a "critical value" c by the formula

       alpha = Prob[ T >= c | Ho true ]

  Reject Ho if you observe T>=c; do not reject if you find T<c.
  The Type-II error probability is

       beta  = Prob[ T <  c | Ho false ]

  Evidently large c makes alpha small at the expense of making beta big;
  small c makes beta small at the expense of making alpha big.  There's
  a trade-off.

--------

If data X1 X2 ... Xn are No(mu,sig^2) WITH SIG KNOWN then a hypothesis

    Ho: mu = mu0

can be tested using T(X) = |xbar - mu0| (which ought to be small if Ho
is true, i.e. mu=mu0, and otherwise big), leading to the 

            "two-sided Z test"

    p = Pr[ |xbar - mu0| >= t | Xj ~ No(mu0, sig^2) ]
      = Pr[ |xbar - mu|/(sig/sqrt n) >= OBS |xbar - mu|/(sig/sqrt n) ]
      = Pr[ |Z| > Observed value z=|xbar - mu|/(sig/sqrt n) ]
      = Phi(-z) + 1-Phi(z)
      = 2*Phi(-z)

If "more extreme" values are big values of mu (instead of *all* values
of mu != mu0) then a better test is the right-tailed

            "one-sided Z test"

    p = Pr[ (xbar - mu0) >= t | Xj ~ No(mu0, sig^2) ]
      = Pr[ (xbar - mu)/(sig/sqrt n) >= OBS (xbar - mu)/(sig/sqrt n) ]
      = Pr[ Z > Observed value z=(xbar - mu)/(sig/sqrt n) ]
      = 1-Phi(z)
      = Phi(-z)

while the left-tailed one-sided test would have

    p = Pr[ (xbar - mu0) <= t | Xj ~ No(mu0, sig^2) ]
      = Pr[ (xbar - mu)/(sig/sqrt n) <= OBS (xbar - mu)/(sig/sqrt n) ]
      = Pr[ Z < Observed value z=(xbar - mu)/(sig/sqrt n) ]
      = Phi(z)


---------

If SIG is unknown, the same idea works with the Normal distribution
replaced with the t distribution--- the famous "t test".  With

               S^2 = (1/n) sum (Xj-xbar)^2

                                 xbar - mu0
and S = sqrt(S^2), we set t = -----------------
                                S / sqrt(n-1)

and set

     p = Pr[ |T| >= |t| ] = 2 Pr [ T <= -|t| ] = 2*pt(-|t|,n-1);

for Student's t distribution w/df=n-1 for the 2-sided test, and

     p = Pr[ T >= t ] = 1-Pr[ T < t] = Pr [ T <= -t ] = pt(-t,n-1);
     p = Pr[ T <= t ] = pt(t,n-1);                  <-- S-Plus /

for the right-tailed and left-tailed one-sided tests, respectively.

----------

For a Bayesian approach to hypothesis testing, we merely find a prior
distribution xi() such that 0 < xi(Ho) < 1 and find the posterior
probability 

                      Int_Ho    [ xi(theta) L(theta|data) d theta ]
     P[ Ho | data ] = ---------------------------------------------
                      Int_Theta [ xi(theta) L(theta|data) d theta ]

where the numerator and denominator are integrals or sums or a bit
of both of the prior * likelihood over just the hypothesis set, in
the numerator ("upstairs"), and over all possible parameter values, 
"downstairs" in the denominator.

Example:

    If we use a mixed prior distribution

    xi(theta=.5) = 1/2              <-- Point mass of size 1/2 at 1/2
    xi(theta)    = 1/2, 0<theta<1   <-- Uniform over the rest of (0,1)

This has CDF
                 0              -oo <  theta < 0
    F(theta) =   theta/2          0 <= theta < 1/2
                 1/2 + theta/2  1/2 <= theta < 1
                 1                1 <= theta < oo

which is discontinuous at theta=1/2 (it jumps there by 1/2=Pr[theta=1/2]).

    The Likelihood, for observing 10 Heads in 10 tries (IE X=10, for
X ~ Binom(10,theta)), is proportional to (theta^10); thus


                                        1/2 * (1/2)^10
    Pr[ th = 1/2 | X=10 ] = -----------------------------------------
                            1/2 * (1/2)^10 + 1/2 int_(0:1) th^10 d th

                                   1/2 * 1/1024                1
                          = -------------------------  =  -----------
                            1/2 * 1/1024 + 1/2 * 1/11     1 + 1024/11

                          = 0.01062802        

THUS, Pr[Ho | X=10 ] = .01.  Note this is about ten times bigger than
the p-value of 1/512 = 0.001953125 we found above.  It is a 

     COMMON MISTAKE

to mix up p-values and posterior probabilities... 

  the p-value   IS NOT  the probability that Ho is true.

It's (sit down for this) the probability of observing data as extreme
as we did observe, IF Ho is true.  It's computed CONDITIONALLY ON Ho,
and just tells how big a miracle we saw IF Ho is true.  It says nothing
about whether Ho is or is not true, only about the miracle size.

EXAMPLE:

Suppose you know the mean mu of Xj ~ No(mu, 1) random variables is
either 0 or 1 (for example, these might be a digital signal observed
with noise.  We have 100 observations, so we form Xbar ~ No(mu,1/10^2).

If Xbar = 0.4, should we accept or reject 

                    Ho: mu=0 
Answer:

       Well, the obvious test statistic to use is T(x)=Xbar, so the
p-value is

           p = Pr[ Xbar >= 0.4 | mu=0 ]
             = Pr[ (Xbar-mu)/(sig/sqrt n) >= (0.4 - 0)/(1/10) = 4 ]
             = Pr[ Z > 4 ] = Phi(-4) = 3.167124e-05,

so of course we'd reject Ho.  But here the only other choice is m=1,
and that's even MORE outrageous for Xbar = 0.4 < 1/2!!!!

With Bayesian prior distribution xi(0) = xi(1) = 1/2,


                                   .5 exp(-(0.4-0)^2/0.02)
    Pr[ Ho | Xbar=0.4 ] = -------------------------------------------------
                          .5 exp(-(0.4-0)^2/0.02) + .5 exp(-(0.4-1)^2/0.02)

                              exp(-8)
                        = ------------------- =  0.9999546
                           exp(-8)+ exp(-18)

so there is a much higher chance of Ho:th=0 than of th=1, and the p-value
is quite misleading.

With 25 observations the same analysis becomes:

           p = Pr[ Xbar >= 0.4 | mu=0 ]
             = Pr[ (Xbar-mu)/(sig/sqrt n) >= (0.4 - 0)/(1/5) = 2 ]
             = Pr[ Z > 2 ] = Phi(-2) = 0.02275013

                                   .5 exp(-(0.4-0)^2/0.04)
    Pr[ Ho | Xbar=0.4 ] = -------------------------------------------------
                          .5 exp(-(0.4-0)^2/0.04) + .5 exp(-(0.4-1)^2/0.04)

                              exp(-4)
                        = ------------------- =  0.9933071,
                           exp(-4)+ exp(-9)

while with 4 observations it becomes

           p = Pr[ (Xbar-mu)/(sig/sqrt n) >= (0.4 - 0)/(1/2) = 0.8 ]
             = Phi(-0.8) = 0.449329
                                   .5 exp(-(0.4-0)^2/0.5)
    Pr[ Ho | Xbar=0.4 ] = -------------------------------------------------
                          .5 exp(-(0.4-0)^2/0.5) + .5 exp(-(0.4-1)^2/0.5)

                                exp(-0.32)
                        = ----------------------- = 0.5986877
                           exp(-0.32)+ exp(-0.72)

Note   P-value < 0.5 < P[Ho|data]  so different decisions would ensue.

=============

Example, due to Jim Berger and Don Berry:

     Observe 13 = X ~ Bi(17, p)

and wish to consider Ho: p=1/2

Classical:

     p-value = Pr[ |X-17/2| > |13-17/2| = 4.5 | p=0.5 ]
             = Pr[ X = 0,1,2,3,4,13,14,15,16,17 | p=0.5 ]
             = 6428/2^17 
             = 0.04904175

Thus, we would Reject Ho at level alpha=0.05 (just barely).

Bayesian:

     Prior xi(1/2) = 1/2
           xi(th)  = (1/2) * (1/2), 0 < th < 1

           (uniform on interval of length 1 centered at 1/2)


     Posterior:

                              (1/2) * (1/2)^17
     Pr[ th=1/2 | X=13 ] = --------------------------------
                           (1/2) * (1/2)^17 + (1/2) * INT

where INT = integral of [th^13 * (1-th)^4] from 0 to 1.  But we
can get that integral from the normalizing factor for the beta
Be(14,5) dist'n:
                                 2^(-17)
     Pr[ th=1/2 | X=13 ] = ------------------------
                           2^(-17) + 13! * 4! / 18!

                         = 1/(1+2^17 * 13! * 4! / 18!)

                         =  0.2463315

which is incredibly higher than the p-value.

In fact, Jim and Don show that for ANY symmetric unimodal prior
distribution (like Un[.5-a, .5+a] for any 0<a<.5, or Be(a,a) for
any a>1, etc), the smallest value Pr[ th=1/2 | X=13 ] can possibly
have is about 0.08, almost double the p-value.  SO,

p-values ARE NOT probabilities of Ho.  They aren't.  Really.

=========  Composite Hypotheses, Alternate Hypotheses  =========

Sometimes a hypothesis Ho is presented along with an alternative
H1; this is intended to help suggest a suitable statistic T that
is usually big when H1 is true and usually small when H0 is true
so that
         p = Pr[ T >= t | Ho ]

will be a suitable p-value for testing Ho vs H1.  When Ho and H1
are both *POINT HYPOTHESES* in the sense that they specify the
distributions f0(x) and f1(x) completely, the Neyman-Pearson Lemma
says that the BEST POSSIBLE T is (any monotone function of) the
Likelihood Ratio

                  R(x) = f1(x) / f0(x)
 
Examples:

     1. Po(lam_0) vs Po(lam_1):

                         (lam_1)^Sum Xj * exp(-n lam_1)/prod X_j!
                  R(x) = ----------------------------------------
                         (lam_0)^Sum Xj * exp(-n lam_0)/prod X_j!

                       = (lam_1/lam_0)^Sum Xj * exp(n(lam_0-lam_1))

                      ox exp( log(lam_1/lam_0) * Sum X_j )

         Suitable monotone functions of R include Sum X_j and X-bar.


     2.  Bi(n,p) vs Bi(n,q):

                  R(x) = [p/q]^Sum Xj * [(1-p)/(1-q)]^(N-Sum Xj) 
                      ox [p(1-q)/q(1-p)] ^ Sum Xj

         so again Sum Xj or X-bar will do.

     3.  Be(a0,b) vs Be(a1,b):

                  R(x) = {ratio of some gamma functions}

                        * (Prod X_j)^{a0-a1}

         so Prod X_j (or Sum log X_j) would be equivalent to R(x).


What if Ho or H1 are *composite* hypotheses, that don't fully specify
the distribution?  Then:

    1.  If Ho is composite, must compute p-value as:

               p = Maximum  Pr[ T(X) >= T(x) | th ]
                   th in Ho

    2.  Usual (and good) choice for T is (any monotone function of):

                 Maximum  { f(x | th1) :  th1 in H1 }
          T(x) = ----------------------------------
                 Maximum  { f(x | th0) :  th0 in Ho }


        Another good choice: For if R(x) is monotonic in th0 and th1,
        and if the same T(x) works for all th0 in H0, th1 in H1, then
        use that T(x) for the composite, too.

====================================================================

Bayesian analysis is easier:  it doesn't matter whether Ho or H1 or
both are composite, just evaluate

        Pr[ Ho | X ]

from the posterior distribution xi(th | x).  This might be discrete,
might be continuous, and might be a little bit of both.