STA114/MTH136 Lecture Notes, Week 1: Intro & Examples 1. Conditional Probability and Inference All probabilities and probability distributions are "conditional" on the background information, information, and knowlege brought to bear in assessing the probability or distribution. First a simple example, to help illustrate the special role conditional probability plays in INFERENCE, the topic of this course. EG1: Rare Disease Suppose that one percent of the population have a genetic condition that predisposes them for a disease, and that the genetic condition can be revealed only by a clinical test. Like all clinical tests, this one is imperfect; still, it's a pretty good one. For subjects WITH the condition, it detects the condition with probability 98%; for subjects WITHOUT the condition, it falsely gives a positive result with probability 2%. For a subject drawn at random from the population, let's find: a) The probability of the genetic condition; b) The probability of a positive test; c) IF the test is positive, the conditional probability of the GC; b) IF the test is false, the conditional probability of the GC. Let's introduce some notation; let THETA be the "disease state" (1 for presence, 0 for absence of the genetic condition), and let T be the test result (1 for "positive", 0 for "negative"). Now we can recognize our four questions as: a) P[Th = 1] = b) P[T = 1] = c) P[Th = 1 | T = 1] = d) P[Th = 1 | T = 0] = These are familiar computations from STA104/MTH135 - level probability: a) P[Th = 1] = (0.01) (this is given in the problem) b) P[T = 1] = (0.01)*(0.98)+(0.99)*(0.02) = .0296 = 3% c) P[Th = 1 | T = 1] = (0.01)*(0.98)/0.0296 = 0.0098/0.0298 = 0.3289 d) P[Th = 1 | T = 0] = (0.01)*(0.02)/0.9704 = 0.0002/0.9704 = 0.0002 The (perhaps surprising) conclusion: a subject with a positive test is still twice as likely to be free from the condition as to have it (i.e. "false positives" are a serious problem); still, a subject with a negative test can be reasured. This is really a problem in INFERENCE: we have an unknown feature (Th) about which we have some evidence (T), and conditional probability gives us the link. BEFORE the evidence, P[Th=1] = 0.01; AFTER the evidence, P[Th=1 | T=1 ] = .3289 P[Th=1 | T=0 ] = .0002 EG2: New Disease Now let's go another step: suppose a new condition becomes of clinical interest, and we have little idea what is the prevalence of the condition; let's denote the probability of the condition by P, 0
.90] and [P<.01] [Homework: calculate them], and much more... e.g. What is the most likely value of P? [ Ambiguous question..... Pr[P=x] = 0 for every x. Still, the PDF for P has a maximum which we can find, by calculus or plotting, as P-Hat = 3/10 = 0.30 ] What is the expected value of P??? [ Not so ambiguous: E[ P | X=3 ] = 4/12 = 1/3, from our familiarity with the Beta distribution ] Is it possible that P=1/2? That P>=1/2? [ Well, Pr[P=.5] = 0; but Pr[P>=.5] = 0.11328, so yes it IS possible ] Is it possible that P>.90? [ Giving away a HW answer, P[ P > .90 | X=3 ] = 1.248e-6 so it's "possible", but not really believable... about one in a million chance ] ***************** Two other paradigms: ** In the 1930's and 40's, Sir Ronald Fisher proposed an entirely different way of thinking about inference. It's harder than the Laplace/Bayes approach, and leads to less useful answers, but the calculations could be done without computers and so it was the most successful way of doing statistics for most of the 20'th century and still is important. The idea is to NOT use probability to think and talk about P, but to play hypothetical games, as follows: IF P=.90, THEN X=3 has very small probability (10:3) .9^3 .1^7 = 8.75e-6 IF P=.40, THEN X=3 has rather larger probability (10:3) .4^3 .6^7 = 0.215 IF P=.30, THEN X=3 has the largest possible probability (10:3) .3^3 .7^7 = 0.267 IF P=.01, THEN X=3 has very small probability (10:3) .01^3 .99^7 = 0.000112 THEREFORE, IF P is as large as .9 or as small as .01, THEN we have just seen a miracle; but IF P is around 0.30 or so, THEN we have seen just what we should expect. Since miracles are unusualy, we conclude that P=.30 or so. ** Both the Bayes/Laplace approach and the Fisherian approach (also called "frequentst" or, ironically, "classical") use probability in a fundamental way to derive and defend their estimates of P; the third school, "LIKELIHOODISTS", are more abstract and treat the "likelihood function" L(p) = (N:X) p^X (1-p)^(N-X) as quantifying the "evidence" about p as a possible value of P; the point with greatest evidence is (again) P-Hat = 0.30, and the points with at least 50% of this much "evidence" would be those for which L(p) > 1/2 * L(P-Hat). This is by far the least common of the three paradigms, at present, so we will concentrate our limited time on the two dominant ones: Bayesian and Frequentist. 2. Conditional Probability Distributions (DeG ch 2, 3.6-7) Since so much of our attention will be focussed on conditional probabilities and conditional probability distributions, it's a good idea to review them and explore them a bit more deeply. Please give a careful read to all of DeGroot's Chapter 2 and to sections 6-10 of Chapter 3, from which most of the HW exercises are drawn this week. ================================================================ First, some review of Conditional Probability Distributions and a fun example, called "Borel's Paradox"; before we can do that, we should review change of variables (COV) in one and in d dimensions. Let X be a continuously-distributed random variable with pdf X ~ f(x) and let U = g(X) for some nice function g: R -> R . If g(x) is 1:1 and differentiable, recall the COV "magic formula" for the density function f(u): f(x) f(u) = ----------- (*) | g'(x) | where x is the (by assumption, unique) solution to "g(x)=u" and where g'(x) = d g(x) / dx is the derivative, and |g'(x)| its absolute value. Here, due to the difficulty of faking subscripts in plain text, we use the same notation f( ) for the density functions of both U and X... but think of them as separate functions, perhaps as "f_U (u)" and "f_X (x)". If the function g(x) is differentiable but not 1:1 (think of sin(x)), we simply add up similar terms over all the preimages: -- f(x) f(u) = > ----------- (**) -- | g'(x) | where the sum extends over all {x: g(x)=u} (and so is zero if there are NO such x's). -------- The vector-valued version of COV is almost identical to this--- if X is a d-dimensional random vector with pdf f(x) (now a function on R^d) and if g is a nice 1:1 differentiable function from R^d -> R^d, then again equation (*) holds, but now we must interpret the derivative carefully. It's conventional to use different notation too, beginning with the Jacobian MATRIX | dg1/dx1 dg1/dx2 ... dg1/dxd | d g | dg2/dx1 dg2/dx2 ... dg2/dxd | J = -------- = | ... ... : ... | d x | dgd/dx1 dgd/dx2 ... dgd/dxd | whose (i,j)'th entry is the partial derivative of g_i(x) with respect to x_j. Now the notation |J| represents the absolute value of the determinant of this Jacobian matrix... in the 2x2 case we will see almost exclusively, this reduces to |J| = | dg1/dx1 * dg2/dx2 - dg1/dx2 * dg2/dx1 | Sometimes it's easier to compute derivatives dx_j/du_i of x_j wrt u_i=g_i than the (needed) ones; this is good enough, since they are numerical inverses: dg_i/dx_j = 1/(dx_j/du_i) It's also possible to compute the OTHER Jacobian "j", the matrix of derivatives of x with respect to u=g(x), and use the matrix relation |J| = 1 / |j| --------- Two Examples: (1) Polar Coordinates: Let (R,TH) = g(X,Y): R = sqrt(X^2 + Y^2) X = R cos(TH) 0 < R < oo TH = arctan (Y/X) Y = R sin(TH) 0 < TH < 2 pi By the Chain Rule, the derivatives are dR/dX = 2X * (1/2) * (X^2+Y^2)^(-1/2) = X/R dR/dY = 2Y * (1/2) * (Y^2+Y^2)^(-1/2) = Y/R dTH/dX = 1/(dX/dTH) = 1/(-Y) = -1/Y dTH/dY = 1/(dY/dTH) = 1/(X) = 1/X | X/R Y/R | so J = | | = | 1/R + 1/R | = 2/R |-1/Y 1/X | and we find f(r,th) = (R/2) f(x,y) For example, if X and Y are independent normally-distributed random variables then 1 - (x^2 + y^2)/2 r - r^2/2 f(r,th) = (r/2) ------ e = ---- e 2 pi 4 pi so R and TH are independent, with uniform distribution for TH, i.e. with marginal density functions f(th) = 1/2pi, 0