More on Estimation, esp. in Exponential Families
------------------------------------------------


If 
      f(x | th) = exp[ eta(th) T(x) - A(th) ] * h(x)

for each X_i among n indep. ident. dist'd RV's X1 ... Xn,
then the JOINT pdf is also of Exponential Family form

 f(x | th) = exp[ eta(th) \sum T(x.j) - n A(th) ] * \prod h(x.j)

SO, it's enough to consider a single (maybe vector-valued)
observation "x".

Claim: MLE is consistent in Expo Fams.

Build-up:

(a) MLE in Expo Fam


Assuming the LH achieves its maximum at an interior point of Theta
where the derivative exists and vanishes, the MLE "th.hat" will be
the solution "th" to:

                  T(x) = A'(th) / eta'(th)      (*)

For samples of size n, this becomes simply

              T.bar(x) = A'(th) / eta'(th)

In the case of a multi-dimensional parameter th in R^p, by the chain
rule we have

        0 = (d/dth.i) { sum eta.j(th) T.j(x) - A(th) }
          = sum [(d eta.j/d th.i) T.j(x) ] - (d/dth.i) A(th)

so "th.hat" is the solution th to the matrix/vector equation

            H T(x) = A'

where H is the matrix with entries (d eta.j/d th.i) T.j(x) and where
A' is the gradient vector with entries (d/dth.i) A(th).

--------

(b) Mean of Sufficient Statistic

Since f(x  | th) is a density, it integrates to one, and so:

   1 =  int_X   exp[ eta(th) T(x) - A(th) ] * h(x)   dx

Upon taking a derivative w.r.t. th, we have

   0 =  int_X {eta'(th) T(x) - A'(th) }  *
              exp[ eta(th) T(x) - A(th) ] * h(x)   dx

     =  eta'(th) * E[ T(X) | th ] - A'(th),

so for fixed th and X ~ f(x | th) the expectation of T(X) is

     E[ T(X) | th ]  =  A'(th) / eta'(th)            (**)

Note the similarity (and differences) between (*) and (**).
Again, in p>1 dimensions a matrix/vector version holds.

--------

By the Law of Large Numbers, T.bar(x) -> E[ T | th ] as n->oo;
if A(th) and eta(th) have continuous derivatives (always the
case for us), then the solution th.hat of (*) must converge to
the solution th of (**), i.e., the MLE must converge to the
true value of th.

WHAT IT MEANS:

If we observe repeated independent observations X.j ~ f(x | th),
all from the same distribution (i.e. for some fixed value of th),
then in the limit as n->oo we will learn th perfectly from the
data----
          th = lim  { th.hat.n ) 
              n->oo

In fact more is true---  not only does the estimation error

          [ th.hat - th ]

go to zero as n->oo, it becomes approximately normally-distributed,
with mean zero and variance 1 / n * A''(th)....  so

     sqrt{ n A''(th) } * [ th.hat - th ]

has a standard normal limiting distribution as n grows.  The
term "A''(th)" can be replaced with its estimate "A''(th.hat)"
if we like, leading to asymptotic interval estimates for th, e.g.,

0.95 = Pr[  th.hat - 1.96/sqrt{n A''(th.hat)} < th < th.hat + ... ]

for *any* exponential family!  Compare this to the formulas we get
for Binomial data (both natural and conventional parameterizations),
and to the exact result for normal distributions with known variance.

-----------------------------------------------------------------

EXAMPLES:

Work out details for n Bernoulli variables, in both conventional
and natural parametrizations.

=================================================================

(c) Conjugate Prior Distributions

If
      f(x | th) = exp[ eta(th) T(x) - A(th) ] * h(x)

and the prior density for th happens to be of the form

     pi( th ) = c * exp[ alp eta(th) - bet A(th) ]   (#)

then the posterior density must be

     pi( th | X.n) = c*  *
          exp[ {alp+sum T(x.j)} eta(th) - {bet+n} A(th) ] 

so it's again of form (#) but with new parameters,

     alp*  =  alp + sum T(x.j)
     bet*  =  bet + n

One can interpret "bet" as a "prior sample size" and "alp/bet"
as the "prior average T(x.j)".  Of course, we need for

      c*(alp,bet,x) =
      int { exp[ {alp+sum T(x.j)} eta(th) - {bet+n} A(th) ] } dth

to be positive and finite--- otherwise the posterior is "improper"
and not suitable for inference.  It's okay if the PRIOR is improper
but not the POSTERIOR.

In the limit as alp->0 and bet->0 this commonly approaches the
"Jeffreys Prior" density proportional to

                  pi( th ) ~ sqrt { A''(th) }

which has the nice property that it's invariant under changes of
variables [explain what that means].

---------------------------------------------------------------------