Statistics, based (loosely) on DeGroot & Schervish
----------------------------------------
Housekeeping Preliminaries:
  1)  This is a hard course.  It is curved to a B+/A-.  Drop-add ends Wed Jan 21.
  2)  Homeworks are due in class on Fridays, and will (usually)
      be returned next Monday's Lab.  Lowest HW score is dropped.
  3)  Sakai is used only for hw & exam scores.  Class website has syllabus.
  4)  A bit about me & course expectations
  5)  Please tell me if you feel your questions or comments unheard

----------------------------------------

Week 1: Intro and Review of Probability

Probability:

   Fix some distribution (SAT: Normal,   mean mu=500, std dev sig=100; 
                          RCT: Binomial, n = 10, p = 0.20; etc)
   with KNOWN pdf or pf f(x) (we know the PARAMETERS too); then

   Predict feature of X ~ f(x), like:
               E[X]        = \int x f(x) dx        {or \sum x f(x) }
               P[X \le 17] = \int_{-oo}^17 f(t) dt {or sum...}

Statistics:

   Observe random variable X=x and try to guess what f(.) was---  often
   "f(.)" will be from some familiar parametric family (as above)
                   { f(x | theta) : theta in Theta },
   so all that is "unknown" is/are the parameter/s.  We'll return later
   to the question of picking such a family, or "statistical model".

   Generically, call the parameter "theta" and try to guess what theta is.
   Usually we observe not just one X but a "random sample", i.e., a 
   VECTOR of INDEPENDENT observations, all from the same distribution,
                        X = (x_1,...,x_n).
   We still call it "X", only now by "Normal" or "Binomial" we now mean a
   VECTOR of (usually independent) observations.

Why?  Goals:

  - Predict:  X_{n+1} = ???; 
  - Test:     See if HS student's math and verbal skills are strong
              enough for him/her to succeed at Duke;
  - Estimate: What fraction p of population is helped by new med treatment?

Modeling:

   The art of CHOOSING a family of probability distributions to describe
   whatever is uncertain in a problem---  involves making ASSUMPTIONS
   (and sometimes checking or at least discussing them), e.g., independence.
   This is as much of a SUBJECTIVE art as it is a science.  We must use
   judgment and experience to recognize what are the most important features
   to reflect.  For example:  in a med trial, is it okay to ignore gender?
   age? weight?  In an engineering trial, is temperature important?  In
   environmental studies, is season? tide stage?  meteorological features?

   These are also "statistical questions" you'll want to get good at.

---

Students in this course will need to be (or become) very familiar with a 
dozen or so probability distributions --- know their pdf's by heart, their
means & variances, the relations among them, etc.  Let's review a few, in 
related groups:

Fishing Example (Po, Ex, Ga):

   N_1 = number of fish caught in one hour
       ~ Po( mu ),

       P[ N_1 = k ] = mu^k e^{-mu} / k!, k=0,1,2,...

   N_t = number of fish caught in t hours:  mu = lambda * t
       ~ Po(lambda * t)

       P[ N_t = k ] = (t * lambda)^k e^{-t * lambda} / k!, k=0,1,2,...


   T_1 = length of time until first fish caught

       P[ T_1 > t ] = P[ N_t = 0 ] = e^(-t * lam), t>0

   ==> F(t) = 1-e^{-lam t} for t>0, and f(t) = F'(t) = lam e^(-lam*t), t>0

   T_k = length of time until k fish caught

       P[ T_k <= t] = P[ N_t >= k ] = sum_{j=k}^oo ...

       ~ Gamma( alpha=k, beta=lam )

Po:  E[ N_1 ] = mu         V[ N_1 ] = mu       N_1 takes 0,1,2 values

Ga:  E[ T^p ] = Ga(alpha+p) / { Ga(alpha) beta^p }
              = alpha/beta for p=1;  alpha(alpha+1)/beta^2 for p=2; ==>
     E[ T ]   = alpha / beta       V[ T ] = alpha / beta^2

 c * T ~ Ga( alpha, beta/c )    (e.g.:  c = 60, hours -> minutes)

-----------------------------------------------------------------------

Sidelight on the Gamma Function:

   Ga(z)   = int_0^oo t^(z-1) e^(-t) dt
   Ga(1)   = 1
   z Ga(z) = Ga(z+1)
          -> Ga(z) = (z-1)! for integers z >= 1;  Ga(0) = oo.

   Ga(1/2) = int_0^oo t^(-1/2) e^(-t) dt    Set t^2 = z; dz = 2t dt

           = int_0^oo e^(-x^2) 2 dx  = int_-oo ^oo e^(-x^2) dx

           = pi (square, then integrate by parts).

   -> Ga(5/2) = 3/2 * Ga(3/2) = 3/2 * 1/2 * Ga(1/2) = 3 Sqrt(pi) / 4

   Stirling's Formula:  Ga(z) = sqrt(2pi) z^(z-1/2) e^(-z) e^(theta.z/12z)
           for some 0 < theta.z < 1...  or, roughly, Ga(z) ~ z^(z-1/2).

   Sometimes we'll want to take derivatives of Gamma functions--- they are
   not available in closed form, but they have their own names:


   "Digamma function"  psi(z)  = (d/dz) log{ Gamma(z) }
                               =  Gamma'(z)/Gamma(z)
   "Trigamma function" psi'(z) = (d^2/dz^2) log{ Gamma(z) }

   Note:
              Gamma(z+1)   =  Gamma(z)       * z    

    ==>  log{ Gamma(z+1) } = log{ Gamma(z) } + log(z)

    ==>  psi (z+1)         = psi(z)      + 1/z
    ==>  psi'(z+1)         = psi'(z)     - 1/z^2

    For integers,

         psi (k) = -gamma_e + 1 + 1/2 + 1/3 + ... + 1/(k-1)
         psi'(k) = (pi^2/6) - 1 - 1/4 - 1/9 - ... - 1/(k-1)^2

    so psi(k) is about log(k) -> oo and psi'(k) is about 1/k -> 0.

    These would come up if we tried to estimate "alpha" parameter for
    Gamma distribution or either parameter for Beta distribution.

=======================================================================

Bernoulli Trials (Un, Bi, Ge, NB, Be):

   U_j ~ Un(0,1);                               Uniform

   Z_j = 1_{ U_j <= p };                        Bernoulli

   X_n = # of successes in n trials             Binomial
       = sum_{ j <= n }  Z_j
       ~ Bi(n, p)

   Y_1 = # of failures before 1'st success      Geometric
       = max{j: X_j = 0}
       ~ Ge(p)

   Y_k = # of failures before k'th success      Negative Binomial
       = max{j: X_j < k}
       ~ NB(k, p)

   U_[k] = k'th smallest of n iid uniform RV's  Beta  Be(k, n-k+1)


   P[ X_n >= k ]
       = P[ at least k U_j's are  <= p ]
       = P[ U_[k] <= p ] = F_k(p), for k'th smallest U_[k]--- so

   f_k(p) = F'_k = (d/dp) Sum{ binom probs }
       ~ Beta( alpha = k, beta = n+1-k )

  E[ Bi (n, p) ] = n*p;       V[ Bi (n, p) ] = n*p(1-p)

  E[ NB(k,p) ] = k * q/p;     V[ NB(k,p) ] = k * q/p^2

  E[ U_j ]     = 1/2;         V[ U_j ]     = 1/12

  E[ Be (a, b) ] = a/(a+b);   V[ Be (a, b) ] = ab/(a+b)^2(a+b+1)
                                             = mu*(1-mu)/(a+b+1)
  E[ U_[k] ] = k/(n+1), 1 <= k <= n 

-----------------------------------------------------------------------

  X ~ Ga(alpha_1, beta)
  Y ~ Ga(alpha_2, beta  =>  X+Y  ~ Ga(alpha_1+alpha_2, beta)
                         X/(X+Y) ~ Be(alpha_1, alpha_2)

For example:  X,Y independent exponentials with same rate
              => X/(X+Y) has std Uniform dist'n, indep of X+Y.

-----------------------------------------------------------------------

Normal & Friends (No, Ga, Chi-Sq):

  X ~ No(mu, sig^2)   ->   Z = [ (X-mu) / sig ] ~ No(0, 1)

  Z^2 has CDF:

  P[ Z^2 < t ] = P[ |Z| < sqrt(t) ] = Phi( sqrt(t) ) - Phi( -sqrt(t) )
               = 2 * Phi( sqrt(t) ) - 1, so

  f(t) = sqrt(2/pi) * 1/2 * t^(1/2 - 1) * e^{- t/2 }, t>0

       ~ Ga( alpha = 1/2, beta = 1/2 )

  => Z = (Z_1,...,Z_n) => |Z|^2 ~ Ga ( n/2, 1/2 )
                       =  "chi squared w/n degrees of fdm"

  E[ |Z|^2 ] = n;  V[ |Z|^2 ] = a/b^2 = 2*n  => E |Z|^4 = n^2 + 2n
                                                        = 3, for n=1.

----------------------------------------------------------------------- 

Change of Variables:

  Y = g(X), X ~ f(x), one dimension:

  (a)  dy = g'(x) dx -> dx = dy / g'(x)
                           = dy / g'( g-inv(y) )

       ==> Y ~ f( g-inv(y) ) /  | g'( g-inv(y) ) |

  (b)  Monotone increasing:

       H ( g(x) ) = P[ g(X) <= g(x) ]  = P[ X <= x ] 
                                       = F (x)
       h ( g(x) ) * g'(x)              = f (x)
       h ( y )         =  f( g-inv(y) ) / g'( g-inv(y) )

       Extend to monotone decreasing, then piecewise.

--------

  Y = g(X), X ~ f(x), n dimensions:


  (a)  dy_i = \sum_j (dg_i/dx_j) dx_j,
       dy_1 dy_2 = [dg_1/dx_1 dx_1 + dg_1/dx_2 dx_2]
                 * [dg_2/dx_1 dx_1 + dg_2/dx_2 dx_2], with
       "dx_i dx_i" = 0 and dx_j dx_i = -dx_i dx_j.

  (b)  E[ phi ( Y ) ]  =  \iint phi( y )   h( y )  dy_1...dy_n

                       =  \iint phi( g(x) ) f( x ) dx_1...dx_n
     
                       =  \iint phi( y ) f( g-inv(y) ) |J(y)| dy_1...dy_n

       where J_ij = dx_i/dy_j and |A| = abs. det. (A)

                  = 1/ abs(det( [dy_i/dx_j] )
---------

Example:

1)  X_1 ~ Be(a, b), X_2 ~ Ga(a+b, 1), g(x1,x2) = [ x1*x2,   (1-x1)*x2 ]

                    [ x1, x2 ]  =   g-inv(y1,y2) = [ y1/(y1+y2) , y1+y2 ]

(a) Differential Method:
 

    dy1 = [  x2 dx1 + x1 dx2 ]
    dy2 = [ -x2 dx1 + (1-x1) dx2]

    dy1 dy2 = x2(1-x1) dx1 dx2 - x2 x1 dx2 dx1
            = (1-x1)x2 dx1 dx2 + x1 x2 dx1 dx2
            = x1 dx1 dx2

    dx1 dx2 = 1/(y1+y2) dy1 dy2

    SO 
                Ga( a+b )
    Y1, Y2 ~  --------------- (y1/(y1+y2))^a-1 (y2/(y1+y2))^b-1  *
                Ga(a) Ga(b)

              (y1+y2)^(a+b-1)
              ---------------    e^-(y1+y2)  1/(y1+y2)  dy1 dy2
                 Ga( a+b )


(b) Jacobian Method:

    J = dx_i/dy_j = [  y2/(y1+y2)^2   -y1/(y1+y2)^2 ]
                    [  1              1             ]

    |J|            = 1/|y1+y2|

                Ga( a+b )
    Y1, Y2 ~  --------------- (y1/(y1+y2))^a-1 (y2/(y1+y2))^b-1  *
                Ga(a) Ga(b)

              (y1+y2)^(a+b-1)
              ---------------    e^-(y1+y2)  / (y1+y2)
                 Ga( a+b )

                   1                              1
            = ---------- y1^(a-1) e^-y1      ---------- y2^(b-1) e^-y2
                Ga(a)                           Ga(a)

SO, Y1, Y2 are independent Ga(a,1), Ga(b,1)   RV's while
    X1, X2 are independent Be(a,b), Ga(a+b,1) RV's.

----------

Really Important ("Canonical") Example:  Multivariate Normal Distribution

  Let mu be p-dim real vector (think of it as a px1 matrix)
      A  be a [lower-triangular?] [invertible?] pxp matrix
      Z  be p-dimensional IID No(0,1) vector   (px1 matrix)

  Note:  Order matters!  And both orders are important:
         Z'Z  = \sum Z_j^2  is 1x1, the squared length (sum of squares);
         Z Z' = { Z_i Z_j } is pxp, with
      E[ Z Z']= I_p (identity matrix), zero off-diag and one on-diag (why?).
              = the COVARIANCE MATRIX for Z.

  Set X = mu + A Z
        ~ No( mu, A A' )  (set C = A A', symmetric pxp COVARIANCE matrix)

What is (joint) pdf for Z?

      prod { 1/sqrt(2pi)  exp(- z_j^2/2) } = [2\pi]^{-p/2}  exp(- z'z/2 )  

What is (joint) pdf for X?  Use change-of-variables formula:

   Note Z = A^(-1) (X-mu), so
      Z'Z = (X-mu)' { A^(-1)' A^(-1) } (X-mu)
          = (X-mu)' C^(-1) (X-mu)

        J = | A | = sqrt{ |C| }

      f(x|mu, C) =  sum_{ z: x=g(z) }  f_Z(z) / |J = dx/dz|

                 = [2 pi]^{-p/2} exp(- (x-mu)' C^(-1} (x-mu)/2 ) /|J|

                 = 1/sqrt{det(2pi C)} exp(- (x-mu)' C^(-1} (x-mu)/2 )

We will seldom need to know coordinate-by-coordinate values for this,
but it IS useful to know the abstract result:  the joint pdf for a
non-singular multivariate Normal random vector is:

          1             - (x-mu)' C^(-1) (x-mu) / 2
    ---------------   e
     sqrt( det C )


Note standard normal is special case of mu=0, C=Id.

---

If someone gave us this function f(x), how could we guess what mu was?
What C was?  Is A uniquely determined?  How can we find one possible A?

---

Note:  Cholesky Decomposition will be lower-triangular solution;
       SVD will be eigen solution, C = U D U'