Where do priors come from?

In a statistical problem where we observe some quantity X whose
distribution depends on an uncertain parameter theta, we regard
the likelihood function as a conditional density function

      f(x | theta)

for X *given theta*; to construct the joint probability distribution
for both X and theta, we need a maginal distribution for theta, i.e.,
a "prior" pi(theta) to complete the specification

      pi(x, theta)  =  f(x | theta) * pi(theta).

-----------------------------------------------------------------

Often the prior distribution pi(theta) is chosen to reflect earlier
evidence about theta, usually rather vague evidence--- for example,
past experience with experimental drugs treating cold symptoms may 
suggest that about 20% of them show enough improvement over standard
treatments to be worth pursuing--- suggesting perhaps a Be(1,4) or
Be(2,8) distribution for the uncertain probability p that a new drug
shows such improvement.

Why *beta* instead of some other distribution?  What other choices 
are there?  What are criteria we might apply?

------------------------------------------------------------------

ASYMPTOTICALLY it really doesn't matter what we choose, so long as
pi(theta)>0 everywhere in the region; the Bernstein/von Mises
("Bayesian central limit") theorem gives conditions under which the
posterior distribution of theta given n iid observations x1...xn
will be asymptotically Normal, mean theta-hat, with covariance the
inverse of the information matrix, (n I(theta))^-1.

The real interest lies with smaller samples, and in the transition
from tiny to moderately large samples.

-------------------------------------------------------------------

An easy option: CONJUGATE PRIORS

-------------

In a natural exponential family, the likelihood for a sample of 
size n is

                         eta * T(x)  -  n A(eta)
        f(x | eta)  =  e                           * h(x)

where $eta\in\cE\subset \bbR^q$ is the 'natural parameter', where
$T(x) \equiv \sum T(x_j)$ is the 'natural sufficient statistic',
and where $A(\eta)$ is the normalizing constant; $h(x)$ is irrelevent.

For any q-dimensional vector $s\in\bbR^q$ and number $n_0$, consider
the prior distribution
                              eta * s - m A(eta)
       \pi(\eta|m,s) =  c(s) e                         ;

with this prior, the posterior distribution would be proportional to
                           \eta * [s + T(x)] - (m+n) A(\eta)
       \pi(\eta|X,\eta) ~ e                                     ,

of the exact same form with now m^* = m+n, s^* = s+T(x).  If the family
$\pi(\eta|m,s)$ is nice enough, this will lead to closed-form solutions
for Bayesian posterior expectations, probabilities, etc.

Warning:  just because it's easy doesn't mean it's a good idea...

Interpretation:  It is as if you have observed an imaginary sample earlier
of size m, whose sufficient statistics had average s/m.

=============
Example
=============

If we observe $X_j\sim \No(\mu, \sigma^2)$ with $\sigma^2$ known, then as
a function of the natural parameter \mu,
                                          -(n/2\sigma^2) (\mu-Xbar)^2
        f(x | \mu)  =  c_0(x,\sigma^2)  e 

                             \mu * (1/sig^2) T_1(X) - n *(1/2\sig^2)* \mu^2
                    =  c_1 e
                                                  s*\mu - m * \mu^2
so a prior distribution of the form \pi(\mu) = c e

would be 'conjugate'...  this is just a normal distribution, with arbitrary
mean and variance.  The posterior for $\mu\sim\No(M,V)$ is


                           M/V  +  n Xbar/sig^2             1
        \mu | X \sim \No( ---------------------- ,  ------------------- ) ,
                           1/V   +  n / sig^2        1/V  +  n / sig^2


centered at the precision-weighted mean.  This is the same posterior as if
you had a FLAT prior distribution and observed $m = sig^2/V$ earlier
observations with average value $M$.

With sig = 1 and V = 1, for example, and M = 0, this simplifies to

                           n Xbar         1
        \mu | X \sim \No( --------- , -------- ) ;
                           1  +  n      1 + n

with n=1 this would have mean Xbar/2 and variance 1/2.  If Xbar is in the
tail of the prior distribution (say, Xbar > 4) this is just silly...  the
posterior is completely inconsistent with BOTH the prior AND the likelihood.

---------------------------------------------------------------------------

Other conjugate pairs are easy to find (Box & Jenkins have all of them):

              X                     \theta

     Bin, Bern, Neg Bin              p ~ Beta
     Normal                          \mu ~ Normal, sig^{-2} ~ Gamma
     Poisson                         lam ~ Gamma
     Uniform                         a, b ~ "double Pareto"
     Multinomial                     Dirichlet


Note that in most cases there is a limiting form of the conjugate
distribution that is 

---------------------------------------------------------------------------

What if the prior and likelihood disagree?

-----------------------

Which would you believe if they disagree wildly?  This is a sign that
something is amiss...  most common are

  -  Misspecified likelihood function (probability model)...
     e.g. outliers in data modeled as "normal"

  -  Insufficiently thought-out prior distribution


Usual advice:  Prior should have "fatter tails" than the LH, so that in case
they disagree markedly, the LH "wins".

Jeffreys did not recommend the "Jeffreys prior" $\pi(\mu)=1$ for normal data;
he recommended a Cauchy prior, centered at 0.  If | Xbar |  is large this is
similar to a flat prior; if | Xbar | is small, this shrinks toward zero.


------------------------------------------------------------------------------

Admissibility

-----------------------

Compare the L2 risk functions of Xbar and of the Bayes estimates
w.r.t. conjugate and to Cauchy priors.  Discuss Jaynes-Stein estimates,
Strawderman.