STA250: Statistics, based loosely on DeGroot & Schervish
----------------------------------------
Week 3: Estimation
7.1 Statistical Inference:
- Parameters: \Theta (dim: k)(D+S use \Omega; sheesh)
- Observations: \cX (dim: n)
- PDF/PF: f(x|\theta)
Tasks:
- Prediction: What will x_{n+1} (probably) be?
- Estimation: What is theta?
- Testing: Is "theta=0" true or false? How about "theta <= 1/2"?
- Decisions: Choose action a\in\cA to minimize loss L(\theta,a)
(loss depends on theta, but decision is based on X)
- Design: Choose rv's Y to help learn about \theta
7.2 Prior and Posterior Distributions:
- Likelihood: View f(x|\theta) as COND'L distribution; need marg'l!
- Prior: \pi(\theta), discrete "pf/pmf" or continuous "pdf" (\xi=\pi)
- BAYES THEOREM
- Posterior: \pi(\theta | \bf x)
= f_n(x | \theta)\pi(\theta) / g_n(x),
\propto f_n(x | \theta)\pi(\theta) [as function of \theta]
f_n(x|\theta) = f(x_1|\theta)...f(x_n|\theta) (joint pdf)
g_n(x) = \int f_n(x | \theta) \pi(\theta) d\theta (marginal for x)
Examples: 2 succ in 10 Binomial tries: 45 p^2 (1-p)^8
8 failures before 2nd succ: (x+alp-1:x) p^alp q^x
= 9 p^2 (1-p)^8
p ~ Un(0,1) ==> p|x ~ Be(3,9); E[p] = 1/4, P[p<1/5] = 0.9673
- LH: Likelihood Function is *any* multiple of f(x|\theta) at observed x=X
- Sequential Observations:
"Prior" at stage n+1 is "Posterior" from stage n
=========================================================================
7.3 Conjugate Prior Distributions:
- Bernoulli sampling: Beta prior
- Poisson sampling: Gamma prior
- Normal sampling: Normal prior
Improper Distributions
Exponential Families:
General: f(x | th) = h(x) exp{ eta(theta) . T(x) - B( th ) }
Natural: f(x | th) = h(x) exp{ eta . T(x) - A( eta ) }
Mean is E[T(X)] = J^{-1} \nabla B
(J_ij = \partial \eta_j/\partial \theta_i)
= \partial \eta_j/\partial \theta_i)
= \nabla A in "natural" coordinates; also
Cov is V[T(X)] = \nabla^2 A (Hessian matrix of 2nd partial derivs)
MLE is sol'n to: T(X) = J^{-1} \nabla B (= \nabla A, in natural coords)
Simple random samples: T -> \sum T; B -> n*B, so MLE satisfies
J^{-1}(hat th) \nabla B(hat th) = (1/n) \sum T(x_j)
while mean of T for true value th of parameter satisfies
J^{-1} (th*) \nabla B(th *) = E[T(X)]
so, by law of large numbers, hat th -> th* as n-> infty, ie,
CONSISTENCY.
In fact, for large n, hat th is approx No( th, Var[T(X)]/n ), so MLE
is asymptotically normally distributed.
7.4 Bayes Estimators (Decision Theory):
Loss: L(\theta, a)
DecF: \delta(x) \in \cA
Risk: R(\th, \del) = E[ L(\th, \del(X) | \th]
= \int L(\th,\del(x)) f(x|\th) dx
ExpL: E[ L(\th, a) | \bx ] = \int_\Omega L(\th,a) \pi(\th|x)d\th
BayR: R(\pi, \del) = E[ L(\th, \del(X)) | X=x]
= \iint L(\th,\del(x)) f(x|\th) dx \pi(\th)d\th
Bayes Estimator: \del^* chosen to minimize BayR.
E.G.: L(\th, a) = |\th-a|^2, mean of Normal or Poisson dist'n
E.G.: L(\th, a) = |\th-a|, median of Exponential dist'n
- Large Samples: DeG example: Bernoulli, n=100, y=10; Be(1,2)/Un priors.
- Consistency: Asymptotically, \del^*(X) ~ No(\th, c/n) for small c
=========================================================================