Where do priors come from? In a statistical problem where we observe some quantity X whose distribution depends on an uncertain parameter theta, we regard the likelihood function as a conditional density function f(x | theta) for X *given theta*; to construct the joint probability distribution for both X and theta, we need a maginal distribution for theta, i.e., a "prior" pi(theta) to complete the specification pi(x, theta) = f(x | theta) * pi(theta). ----------------------------------------------------------------- Often the prior distribution pi(theta) is chosen to reflect earlier evidence about theta, usually rather vague evidence--- for example, past experience with experimental drugs treating cold symptoms may suggest that about 20% of them show enough improvement over standard treatments to be worth pursuing--- suggesting perhaps a Be(1,4) or Be(2,8) distribution for the uncertain probability p that a new drug shows such improvement. Why *beta* instead of some other distribution? What other choices are there? What are criteria we might apply? ------------------------------------------------------------------ ASYMPTOTICALLY it really doesn't matter what we choose, so long as pi(theta)>0 everywhere in the region; the Bernstein/von Mises ("Bayesian central limit") theorem gives conditions under which the posterior distribution of theta given n iid observations x1...xn will be asymptotically Normal, mean theta-hat, with covariance the inverse of the information matrix, (n I(theta))^-1. The real interest lies with smaller samples, and in the transition from tiny to moderately large samples. ------------------------------------------------------------------- An easy option: CONJUGATE PRIORS ------------- In a natural exponential family, the likelihood for a sample of size n is eta * T(x) - n A(eta) f(x | eta) = e * h(x) where $eta\in\cE\subset \bbR^q$ is the 'natural parameter', where $T(x) \equiv \sum T(x_j)$ is the 'natural sufficient statistic', and where $A(\eta)$ is the normalizing constant; $h(x)$ is irrelevent. For any q-dimensional vector $s\in\bbR^q$ and number $n_0$, consider the prior distribution eta * s - m A(eta) \pi(\eta|m,s) = c(s) e ; with this prior, the posterior distribution would be proportional to \eta * [s + T(x)] - (m+n) A(\eta) \pi(\eta|X,\eta) ~ e , of the exact same form with now m^* = m+n, s^* = s+T(x). If the family $\pi(\eta|m,s)$ is nice enough, this will lead to closed-form solutions for Bayesian posterior expectations, probabilities, etc. Warning: just because it's easy doesn't mean it's a good idea... Interpretation: It is as if you have observed an imaginary sample earlier of size m, whose sufficient statistics had average s/m. ============= Example ============= If we observe $X_j\sim \No(\mu, \sigma^2)$ with $\sigma^2$ known, then as a function of the natural parameter \mu, -(n/2\sigma^2) (\mu-Xbar)^2 f(x | \mu) = c_0(x,\sigma^2) e \mu * (1/sig^2) T_1(X) - n *(1/2\sig^2)* \mu^2 = c_1 e s*\mu - m * \mu^2 so a prior distribution of the form \pi(\mu) = c e would be 'conjugate'... this is just a normal distribution, with arbitrary mean and variance. The posterior for $\mu\sim\No(M,V)$ is M/V + n Xbar/sig^2 1 \mu | X \sim \No( ---------------------- , ------------------- ) , 1/V + n / sig^2 1/V + n / sig^2 centered at the precision-weighted mean. This is the same posterior as if you had a FLAT prior distribution and observed $m = sig^2/V$ earlier observations with average value $M$. With sig = 1 and V = 1, for example, and M = 0, this simplifies to n Xbar 1 \mu | X \sim \No( --------- , -------- ) ; 1 + n 1 + n with n=1 this would have mean Xbar/2 and variance 1/2. If Xbar is in the tail of the prior distribution (say, Xbar > 4) this is just silly... the posterior is completely inconsistent with BOTH the prior AND the likelihood. --------------------------------------------------------------------------- Other conjugate pairs are easy to find (Box & Jenkins have all of them): X \theta Bin, Bern, Neg Bin p ~ Beta Normal \mu ~ Normal, sig^{-2} ~ Gamma Poisson lam ~ Gamma Uniform a, b ~ "double Pareto" Multinomial Dirichlet Note that in most cases there is a limiting form of the conjugate distribution that is --------------------------------------------------------------------------- What if the prior and likelihood disagree? ----------------------- Which would you believe if they disagree wildly? This is a sign that something is amiss... most common are - Misspecified likelihood function (probability model)... e.g. outliers in data modeled as "normal" - Insufficiently thought-out prior distribution Usual advice: Prior should have "fatter tails" than the LH, so that in case they disagree markedly, the LH "wins". Jeffreys did not recommend the "Jeffreys prior" $\pi(\mu)=1$ for normal data; he recommended a Cauchy prior, centered at 0. If | Xbar | is large this is similar to a flat prior; if | Xbar | is small, this shrinks toward zero. ------------------------------------------------------------------------------ Admissibility ----------------------- Compare the L2 risk functions of Xbar and of the Bayes estimates w.r.t. conjugate and to Cauchy priors. Discuss Jaynes-Stein estimates, Strawderman.