STA114/MTH136 Lecture Notes, Week 8: Hypothesis Testing Introduction: Tiny Toy Example: Suppose I toss a coin 10 times and claim I find 10 Heads. Do you believe that this is a fair coin? [ presumably, no ] Even though the probability of ANY specific sequence is 1/1024? [ still no ] WHY? Answer: Because such an extreme outcome is very unlikely, for a fair coin, but you can easily imagine me lying or using a 2-headed coin or otherwise NOT having the number X of heads following a Bi(10,0.5) distribution. Thus, EITHER: It's a fair coin and I just witnessed a near-miracle OR: It's not a fair coin. The skeptical position is to deny (most) miracles, and conclude that probably the coin is unfair--- even though it is quite possible for a fair coin to fall heads 10 times in a row. The probability of 10 heads or tails is 1/512 = 0.001953125 < 0.002. ============= Classical Hypothesis Testing ============= A "Hypothesis" is an assertion about the *population*, i.e. the probability distribution of the data, *not* about the data themselves. Classical hypothesis testing is concerned with discovering and reporting whether or not the observed data X are consistent with a possible hypothesis Ho. Formally, what most investigators do is to pick a small number alpha>0 (.05 and .01 are popular) and report a "p-value": p = Prob[ Data as extreme or more so AGAINST Ho | Ho true ] This measures "how big a miracle" you've seen; SMALL p is regarded as STRONG evidence AGAINST Ho, so: REJECT Ho if p <= alpha FAIL TO REJECT Ho if p > alpha -------- Of course you've got to figure out which outcomes ARE "more extreme" than the ones you saw... and there can be lots of different ways of doing this. In principle you're free to choose ANY statistic T(X) that you'd expect to be small when Ho is true and big when Ho is false, then report as p-value p = Prob[ T >= Observed t | Ho true ] If Ho is true then this "p-value" will have a Uniform distribution on (0,1); if false then T should be bigger than "reasonable" and so p will be smaller than typical U[0,1]'s. The idea is that you Reject Ho with probability alpha when in fact Ho is true (a mistake called a "type I error"), but hopefully you Reject Ho with a much higher probability (the "power") when Ho is false. The other kind of error you might make is a "type II error", Failing to reject Ho when Ho is false; it's probability is called beta = Prob [ Reject Ho | Ho false ] -------- Testing at level alpha, as described above, is always equivalent to the following procedure: Compute a "critical value" c by the formula alpha = Prob[ T >= c | Ho true ] Reject Ho if you observe T>=c; do not reject if you find T= t | Xj ~ No(mu0, sig^2) ] = Pr[ |xbar - mu|/(sig/sqrt n) >= OBS |xbar - mu|/(sig/sqrt n) ] = Pr[ |Z| > Observed value z=|xbar - mu|/(sig/sqrt n) ] = Phi(-z) + 1-Phi(z) = 2*Phi(-z) If "more extreme" values are big values of mu (instead of *all* values of mu != mu0) then a better test is the right-tailed "one-sided Z test" p = Pr[ (xbar - mu0) >= t | Xj ~ No(mu0, sig^2) ] = Pr[ (xbar - mu)/(sig/sqrt n) >= OBS (xbar - mu)/(sig/sqrt n) ] = Pr[ Z > Observed value z=(xbar - mu)/(sig/sqrt n) ] = 1-Phi(z) = Phi(-z) while the left-tailed one-sided test would have p = Pr[ (xbar - mu0) <= t | Xj ~ No(mu0, sig^2) ] = Pr[ (xbar - mu)/(sig/sqrt n) <= OBS (xbar - mu)/(sig/sqrt n) ] = Pr[ Z < Observed value z=(xbar - mu)/(sig/sqrt n) ] = Phi(z) --------- If SIG is unknown, the same idea works with the Normal distribution replaced with the t distribution--- the famous "t test". With S^2 = (1/n) sum (Xj-xbar)^2 xbar - mu0 and S = sqrt(S^2), we set t = ----------------- S / sqrt(n-1) and set p = Pr[ |T| >= |t| ] = 2 Pr [ T <= -|t| ] = 2*pt(-|t|,n-1); for Student's t distribution w/df=n-1 for the 2-sided test, and p = Pr[ T >= t ] = 1-Pr[ T < t] = Pr [ T <= -t ] = pt(-t,n-1); p = Pr[ T <= t ] = pt(t,n-1); <-- S-Plus / for the right-tailed and left-tailed one-sided tests, respectively. ---------- For a Bayesian approach to hypothesis testing, we merely find a prior distribution xi() such that 0 < xi(Ho) < 1 and find the posterior probability Int_Ho [ xi(theta) L(theta|data) d theta ] P[ Ho | data ] = --------------------------------------------- Int_Theta [ xi(theta) L(theta|data) d theta ] where the numerator and denominator are integrals or sums or a bit of both of the prior * likelihood over just the hypothesis set, in the numerator ("upstairs"), and over all possible parameter values, "downstairs" in the denominator. Example: If we use a mixed prior distribution xi(theta=.5) = 1/2 <-- Point mass of size 1/2 at 1/2 xi(theta) = 1/2, 0= 0.4 | mu=0 ] = Pr[ (Xbar-mu)/(sig/sqrt n) >= (0.4 - 0)/(1/10) = 4 ] = Pr[ Z > 4 ] = Phi(-4) = 3.167124e-05, so of course we'd reject Ho. But here the only other choice is m=1, and that's even MORE outrageous for Xbar = 0.4 < 1/2!!!! With Bayesian prior distribution xi(0) = xi(1) = 1/2, .5 exp(-(0.4-0)^2/0.02) Pr[ Ho | Xbar=0.4 ] = ------------------------------------------------- .5 exp(-(0.4-0)^2/0.02) + .5 exp(-(0.4-1)^2/0.02) exp(-8) = ------------------- = 0.9999546 exp(-8)+ exp(-18) so there is a much higher chance of Ho:th=0 than of th=1, and the p-value is quite misleading. With 25 observations the same analysis becomes: p = Pr[ Xbar >= 0.4 | mu=0 ] = Pr[ (Xbar-mu)/(sig/sqrt n) >= (0.4 - 0)/(1/5) = 2 ] = Pr[ Z > 2 ] = Phi(-2) = 0.02275013 .5 exp(-(0.4-0)^2/0.04) Pr[ Ho | Xbar=0.4 ] = ------------------------------------------------- .5 exp(-(0.4-0)^2/0.04) + .5 exp(-(0.4-1)^2/0.04) exp(-4) = ------------------- = 0.9933071, exp(-4)+ exp(-9) while with 4 observations it becomes p = Pr[ (Xbar-mu)/(sig/sqrt n) >= (0.4 - 0)/(1/2) = 0.8 ] = Phi(-0.8) = 0.449329 .5 exp(-(0.4-0)^2/0.5) Pr[ Ho | Xbar=0.4 ] = ------------------------------------------------- .5 exp(-(0.4-0)^2/0.5) + .5 exp(-(0.4-1)^2/0.5) exp(-0.32) = ----------------------- = 0.5986877 exp(-0.32)+ exp(-0.72) Note P-value < 0.5 < P[Ho|data] so different decisions would ensue. ============= Example, due to Jim Berger and Don Berry: Observe 13 = X ~ Bi(17, p) and wish to consider Ho: p=1/2 Classical: p-value = Pr[ |X-17/2| > |13-17/2| = 4.5 | p=0.5 ] = Pr[ X = 0,1,2,3,4,13,14,15,16,17 | p=0.5 ] = 6428/2^17 = 0.04904175 Thus, we would Reject Ho at level alpha=0.05 (just barely). Bayesian: Prior xi(1/2) = 1/2 xi(th) = (1/2) * (1/2), 0 < th < 1 (uniform on interval of length 1 centered at 1/2) Posterior: (1/2) * (1/2)^17 Pr[ th=1/2 | X=13 ] = -------------------------------- (1/2) * (1/2)^17 + (1/2) * INT where INT = integral of [th^13 * (1-th)^4] from 0 to 1. But we can get that integral from the normalizing factor for the beta Be(14,5) dist'n: 2^(-17) Pr[ th=1/2 | X=13 ] = ------------------------ 2^(-17) + 13! * 4! / 18! = 1/(1+2^17 * 13! * 4! / 18!) = 0.2463315 which is incredibly higher than the p-value. In fact, Jim and Don show that for ANY symmetric unimodal prior distribution (like Un[.5-a, .5+a] for any 01, etc), the smallest value Pr[ th=1/2 | X=13 ] can possibly have is about 0.08, almost double the p-value. SO, p-values ARE NOT probabilities of Ho. They aren't. Really. ========= Composite Hypotheses, Alternate Hypotheses ========= Sometimes a hypothesis Ho is presented along with an alternative H1; this is intended to help suggest a suitable statistic T that is usually big when H1 is true and usually small when H0 is true so that p = Pr[ T >= t | Ho ] will be a suitable p-value for testing Ho vs H1. When Ho and H1 are both *POINT HYPOTHESES* in the sense that they specify the distributions f0(x) and f1(x) completely, the Neyman-Pearson Lemma says that the BEST POSSIBLE T is (any monotone function of) the Likelihood Ratio R(x) = f1(x) / f0(x) Examples: 1. Po(lam_0) vs Po(lam_1): (lam_1)^Sum Xj * exp(-n lam_1)/prod X_j! R(x) = ---------------------------------------- (lam_0)^Sum Xj * exp(-n lam_0)/prod X_j! = (lam_1/lam_0)^Sum Xj * exp(n(lam_0-lam_1)) ox exp( log(lam_1/lam_0) * Sum X_j ) Suitable monotone functions of R include Sum X_j and X-bar. 2. Bi(n,p) vs Bi(n,q): R(x) = [p/q]^Sum Xj * [(1-p)/(1-q)]^(N-Sum Xj) ox [p(1-q)/q(1-p)] ^ Sum Xj so again Sum Xj or X-bar will do. 3. Be(a0,b) vs Be(a1,b): R(x) = {ratio of some gamma functions} * (Prod X_j)^{a0-a1} so Prod X_j (or Sum log X_j) would be equivalent to R(x). What if Ho or H1 are *composite* hypotheses, that don't fully specify the distribution? Then: 1. If Ho is composite, must compute p-value as: p = Maximum Pr[ T(X) >= T(x) | th ] th in Ho 2. Usual (and good) choice for T is (any monotone function of): Maximum { f(x | th1) : th1 in H1 } T(x) = ---------------------------------- Maximum { f(x | th0) : th0 in Ho } Another good choice: For if R(x) is monotonic in th0 and th1, and if the same T(x) works for all th0 in H0, th1 in H1, then use that T(x) for the composite, too. ==================================================================== Bayesian analysis is easier: it doesn't matter whether Ho or H1 or both are composite, just evaluate Pr[ Ho | X ] from the posterior distribution xi(th | x). This might be discrete, might be continuous, and might be a little bit of both.