STA250: Statistics, based loosely on DeGroot & Schervish
----------------------------------------
Week 5: More on Estimation, esp. in Exponential Families
------------------------------------------------
Definition: A statistical model { f(x | th): \th in Theta }
is an "Exponential Family" if there exists a
- function eta: Theta -> R^k, the "natural parameter";
- function T: X -> R^k, the "nat. sufficient stat.";
- function h: X -> R+, of no particular interest;
- function A: Theta -> R,
such that f(x | th) can be written in the form:
f(x | th) = exp[ eta(th) T(x) - A(th) ] * h(x) (*)
----------------------------------------------------------------
If X.1 ... X.n are iid from the pdf in (*), then the VECTOR
x := (X.1, ..., X.n) has (joint) pdf
f(x | th) = exp[ eta(th) \sum T(x.j) - n A(th) ] * \prod h(x.j)
which is ALSO of Exponential Family form, with the same natural
parameter eta() and with T, A closely related to those from the
individual components:
T.n(x) = \sum T(x.j) A.n(th) = n A(th)
SO, it's enough to consider a single (maybe vector-valued) observation
"x".
Claim: MLE is consistent in Expo Fams.
Build-up:
(a) MLE in Expo Fam
Assuming the LH achieves its maximum at an interior point of Theta
where the derivative exists and vanishes, the MLE "th.hat" will be
the solution "th" to:
T(x) = A'(th) / eta'(th) (*)
For samples of size n, this becomes simply
T.bar(x) = A'(th) / eta'(th)
In the case of a multi-dimensional parameter th in R^k, by the chain
rule we have
0 = (d/dth.i) { sum eta.j(th) T.j(x) - A(th) }
= sum [(d eta.j/d th.i) T.j(x) ] - (d/dth.i) A(th)
so "th.hat" is the solution th to the matrix/vector equation
H T(x) = A'
where H is the matrix with entries (d eta.j/d th.i) T.j(x) and where
A' is the gradient vector with entries (d/dth.i) A(th).
--------
(b) Mean of Sufficient Statistic
Since f(x | th) is a density, it integrates to one, and so:
1 = int_X exp[ eta(th) T(x) - A(th) ] * h(x) dx
Upon taking a derivative w.r.t. th, we have
0 = int_X {eta'(th) T(x) - A'(th) } *
exp[ eta(th) T(x) - A(th) ] * h(x) dx
= eta'(th) * E[ T(X) | th ] - A'(th),
so for fixed th and X ~ f(x | th) the expectation of T(X) is
E[ T(X) | th ] = A'(th) / eta'(th) (**)
Note the similarity (and differences) between (*) and (**).
Again, in k>1 dimensions a matrix/vector version holds.
--------
By the Law of Large Numbers, T.bar(x) -> E[ T | th ] as n->oo;
if A(th) and eta(th) have continuous derivatives (always the
case for us), then the solution th.hat of (*) must converge to
the solution th of (**), i.e., the MLE must converge to the
true value of th.
WHAT IT MEANS:
If we observe repeated independent observations X.j ~ f(x | th),
all from the same distribution (i.e. for some fixed value of th),
then in the limit as n->oo we will learn th perfectly from the
data----
th = lim { th.hat.n )
n->oo
In fact more is true--- not only does the estimation error
[ th.hat - th ]
go to zero as n->oo, it becomes approximately normally-distributed,
with mean zero and variance 1 / n * A''(th).... so
sqrt{ n A''(th) } * [ th.hat - th ]
has a standard normal limiting distribution as n grows. The
term "A''(th)" can be replaced with its estimate "A''(th.hat)"
if we like, leading to asymptotic interval estimates for th, e.g.,
0.95 = Pr[ th.hat - 1.96/sqrt{n A''(th.hat)} < th < th.hat + ... ]
for *any* exponential family! Compare this to the formulas we get
for Binomial data (both natural and conventional parameterizations),
and to the exact result for normal distributions with known variance.
-----------------------------------------------------------------
EXAMPLES:
Work out details for n Bernoulli variables, in both conventional
and natural parametrizations.
=================================================================
(c) Conjugate Prior Distributions
If
f(x | th) = exp[ eta(th) T(x) - A(th) ] * h(x)
and the prior density for th happens to be of the form
pi( th ) = c(th) * exp[ alp eta(th) - bet A(th) ] (#)
then the posterior density must be
pi( th | X.n) = c(th) *
exp[ {alp+sum T(x.j)} eta(th) - {bet+n} A(th) ]
so it's again of form (#) but with new parameters,
alp* = alp + sum T(x.j)
bet* = bet + n
One can interpret "bet" as a "prior sample size" and "alp/bet"
as the "prior average T(x.j)". Of course, we need for
int { c(th) exp[ {alp+sum T(x.j)} eta(th) - {bet+n} A(th) ] } dth
to be positive and finite--- otherwise the posterior is "improper"
and not suitable for inference. It's okay if the PRIOR is improper
but not the POSTERIOR.
In the limit as alp->0 and bet->0 this commonly approaches the
"Jeffreys Prior" density proportional to
pi( th ) ~ sqrt { A''(th) }
which has the nice property that it's invariant under changes of
variables [explain what that means].
---------------------------------------------------------------------