STA250: Statistics, based loosely on DeGroot & Schervish ---------------------------------------- Week 3: Estimation 7.1 Statistical Inference: - Parameters: \Theta (dim: k)(D+S use \Omega; sheesh) - Observations: \cX (dim: n) - PDF/PF: f(x|\theta) Tasks: - Prediction: What will x_{n+1} (probably) be? - Estimation: What is theta? - Testing: Is "theta=0" true or false? How about "theta <= 1/2"? - Decisions: Choose action a\in\cA to minimize loss L(\theta,a) (loss depends on theta, but decision is based on X) - Design: Choose rv's Y to help learn about \theta 7.2 Prior and Posterior Distributions: - Likelihood: View f(x|\theta) as COND'L distribution; need marg'l! - Prior: \pi(\theta), discrete "pf/pmf" or continuous "pdf" (\xi=\pi) - BAYES THEOREM - Posterior: \pi(\theta | \bf x) = f_n(x | \theta)\pi(\theta) / g_n(x), \propto f_n(x | \theta)\pi(\theta) [as function of \theta] f_n(x|\theta) = f(x_1|\theta)...f(x_n|\theta) (joint pdf) g_n(x) = \int f_n(x | \theta) \pi(\theta) d\theta (marginal for x) Examples: 2 succ in 10 Binomial tries: 45 p^2 (1-p)^8 8 failures before 2nd succ: (x+alp-1:x) p^alp q^x = 9 p^2 (1-p)^8 p ~ Un(0,1) ==> p|x ~ Be(3,9); E[p] = 1/4, P[p<1/5] = 0.9673 - LH: Likelihood Function is *any* multiple of f(x|\theta) at observed x=X - Sequential Observations: "Prior" at stage n+1 is "Posterior" from stage n ========================================================================= 7.3 Conjugate Prior Distributions: - Bernoulli sampling: Beta prior - Poisson sampling: Gamma prior - Normal sampling: Normal prior Improper Distributions Exponential Families: General: f(x | th) = h(x) exp{ eta(theta) . T(x) - B( th ) } Natural: f(x | th) = h(x) exp{ eta . T(x) - A( eta ) } Mean is E[T(X)] = J^{-1} \nabla B (J_ij = \partial \eta_j/\partial \theta_i) = \partial \eta_j/\partial \theta_i) = \nabla A in "natural" coordinates; also Cov is V[T(X)] = \nabla^2 A (Hessian matrix of 2nd partial derivs) MLE is sol'n to: T(X) = J^{-1} \nabla B (= \nabla A, in natural coords) Simple random samples: T -> \sum T; B -> n*B, so MLE satisfies J^{-1}(hat th) \nabla B(hat th) = (1/n) \sum T(x_j) while mean of T for true value th of parameter satisfies J^{-1} (th*) \nabla B(th *) = E[T(X)] so, by law of large numbers, hat th -> th* as n-> infty, ie, CONSISTENCY. In fact, for large n, hat th is approx No( th, Var[T(X)]/n ), so MLE is asymptotically normally distributed. 7.4 Bayes Estimators (Decision Theory): Loss: L(\theta, a) DecF: \delta(x) \in \cA Risk: R(\th, \del) = E[ L(\th, \del(X) | \th] = \int L(\th,\del(x)) f(x|\th) dx ExpL: E[ L(\th, a) | \bx ] = \int_\Omega L(\th,a) \pi(\th|x)d\th BayR: R(\pi, \del) = E[ L(\th, \del(X)) | X=x] = \iint L(\th,\del(x)) f(x|\th) dx \pi(\th)d\th Bayes Estimator: \del^* chosen to minimize BayR. E.G.: L(\th, a) = |\th-a|^2, mean of Normal or Poisson dist'n E.G.: L(\th, a) = |\th-a|, median of Exponential dist'n - Large Samples: DeG example: Bernoulli, n=100, y=10; Be(1,2)/Un priors. - Consistency: Asymptotically, \del^*(X) ~ No(\th, c/n) for small c =========================================================================