class: center, middle, inverse, title-slide # The EM algorithm (2) ### Yue Jiang ### Duke University --- ### The EM algorithm The .vocab[EM algorithm] (expectation-maximization) is an iterative procedure that often leverages a .vocab[complete-data likelihood] to perform estimation about parameters in the observed data. 1. Initialize parameter values 2. .vocab[E-step]: construct expected log-likelihood function, where expectation takes advantage of latent variable formulation and is taken using parameter estimates from current M-step 3. .vocab[M-step]: calculate maximum likelihood estimates from expected log-likelihood function constructed from current E-step to inform distribution of latent variables 4. Repeat until convergence criterion is satisfied We "guess" the latent variables and use them to maximize an "easier" likelihood. The EM algorithm hinges on defining a useful complete data log-likelihood. --- ### A few caveats The M-step doesn't maximize the *observed* log-likelihood, but rather a surrogate function given by the conditional expectation of the complete data log-likelihood. The good news is that at each iteration the value of the observed log-likelihood will never decrease. However, if initial values are chosen poorly, the EM algorithm may get stuck in a local maximum or stationary value. In the previous example we found a closed form solution to the M-step (these are simply MLE estimates from normal distributions). However, sometimes no closed form exists and we must rely on numerical methods to obtain solutions (e.g., EM algorithm with one-step Newton-Raphson in the M step). --- ### The EM algorithm Note that the E and M steps "cycle," and so we don't have to start with one or the other. In practice, the convergence criterion is often based on changes in the log-likelihood function evaluated at parameter estimates at each iteration. --- ### The ABO blood group system <img src="img/blood.jpg" width="80%" style="display: block; margin: auto;" /> --- ### The ABO blood group system .pull-left[ Human blood group refers to a classification of blood based on presence or absence of certain surface antigens on red blood cells. One of the most important blood group systems is the ABO system; during blood transfusion, if mismatches occur in the ABO blood group system, potentially fatal reactions may occur. ] .pull-right[ <img src="img/karl.jpg" width="80%" style="display: block; margin: auto;" /> ] --- ### The ABO blood group system .pull-left[ <img src="img/abo.jpg" width="100%" style="display: block; margin: auto;" /> . ] .pull-right[ - Three alleles: A, B, and O - One allele inherited from each parent - A and B alleles are co-dominant; O is recessive - Six unique genotypes given in table to the left - Four phenotypes possible given these genotypes: A-type blood, B-type blood, AB-type blood, and O-type blood ] --- ### The ABO blood group system .pull-left[ <img src="img/abo.jpg" width="100%" style="display: block; margin: auto;" /> . ] .pull-right[ Let `\(p_A\)`, `\(p_B\)`, and `\(p_O\)` be .vocab[allele frequencies] of underlying alleles. Then .vocab[genotype frequencies] are: - AA: `\(p_A^2\)` - BB: `\(p_B^2\)` - AB: `\(2p_Ap_B\)` - AO: `\(2p_Ap_O\)` - BO: `\(2p_Bp_O\)` - OO: `\(p^2_O\)` ] --- ### Allele frequency estimation Assume genotype counts are multinomially distributed: `\begin{align*} P(N_{AA} = n_{AA}, \cdots, N_{OO} = n_{OO}) = \frac{n!}{n_{AA}\cdots!n_{OO}!}p_{AA}^{n_{AA}}\cdots p_{OO}^{n_{OO}}, \end{align*}` for `\(\mathcal{B} = \{AA, BB, AB, AO, BO, OO \}\)`, and where `\(\sum\limits_{i \in \mathcal{B}} n_{i} = n\)`. We observe the phenotype counts `\(N_A\)`, `\(N_B\)`, `\(N_{AB}\)` and `\(N_O\)` only. .question[ Using only *observed phenotype counts*, can we estimate **allele frequencies** in the underlying population? What is the observed data likelihood? ] --- # Allele frequency estimation Using the observed *phenotype* frequencies, we have: `\begin{align*} \mathcal{L}(p_A, p_B, p_O) &\propto \left(p_A^2 + 2p_Ap_O \right)^{n_A}\left(p_B^2 + 2p_Bp_O \right)^{n_B}\times\\ &\mathrel{\phantom{=}}\left(2p_Ap_B\right)^{n_{AB}}\left(p_O^{2}\right)^{n_O}\\ \log\mathcal{L}(p_A, p_B, p_O) &= n_A\log\left(p_A^2 + 2p_Ap_O \right) + \\ &\mathrel{\phantom{=}} n_A\log\left(p_A^2 + 2p_Ap_O \right) + \\ &\mathrel{\phantom{=}} n_{AB}\log(2p_Ap_B) + 2n_O \log(p_O) + \\ &\mathrel{\phantom{=}} const. \end{align*}` <!-- \mathcal{L}(p_A, p_B) &\propto \left(p_A^2 + 2p_A(1 - p_A - p_B) \right)^{n_A} \times \\ --> <!-- &\mathrel{\phantom{=}}\left(p_B^2 + 2p_B(1 - p_A - p_B) \right)^{n_B}\times\\ --> <!-- &\mathrel{\phantom{=}}\left(2p_Ap_B\right)^{n_{AB}}\left((1 - p_A - p_B)^2\right)^{n_O}\\ --> <!-- \log\mathcal{L}(p_A, p_B) &= n_A\log\left(p_A^2 + 2p_A(1 - p_A - p_B) \right) +\\ --> <!-- &\mathrel{\phantom{=}} n_B\log\left(p_B^2 + 2p_B(1 - p_A - p_B) \right) + \\ --> <!-- &\mathrel{\phantom{=}} n_{AB}\log(2p_Ap_B) + 2n_O \log(1 - p_A - p_B) + \\ --> <!-- &\mathrel{\phantom{=}} const. --> Setting partial derivatives equal to zero and solving is a disaster. .question[ How would you use a latent variable to write the complete data log-likelihood using the *genotype* frequencies? ] --- ### The ABO blood group system The complete data log-likelihood is `\begin{align*} \log\mathcal{L}(p_A, p_B, p_O) &= n_{AA}\log(p_A^2) + n_{BB}\log(p_B^2) + \\ &\mathrel{\phantom{=}} n_{AO}\log(2p_Ap_O) + n_{BO}\log(2p_Bp_O) + \\ &\mathrel{\phantom{=}} n_{AB}\log(2p_Ap_B) + n_{OO}\log(p_O^2) + const., \end{align*}` which we notice is much easier to maximize (no more sums of different parameters inside a log). .question[ How could you maximize this log-likelihood function with respect to the parameters of interest (watch out for a constraint)? ] --- ### The ABO blood group system Let's calculate MLEs for `\(p_A\)`, `\(p_B\)`, and `\(p_O\)` using the complete data log-likelihood. We can use Lagrange multipliers to solve the optimization problem under the constraint `\(p_A + p_B + p_O = 1\)` (again, no work shown; please verify!): `\begin{align*} \hat{p}_A &= \frac{2n_{AA} + n_{AO} + n_{AB}}{2n}\\ \hat{p}_B &= \frac{2n_{BB} + n_{BO} + n_{AB}}{2n}\\ \hat{p}_O &= \frac{2n_{OO} + n_{AO} + n_{BO}}{2n}. \end{align*}` .question[ Now how about the E-step? ] --- ### The ABO blood group system Once again, we never observe the latent variables (in this case the genotype frequencies). However we can again get useful information by maximizing conditional expectations of the log-likelihood for the latent data, given the observed data and model parameters. For observed blood types `\(AB\)` and `\(O\)` which can only happen given a single genotype combination we have `\begin{align*} E(n_{AB} | \mathbf{X}, p_A, p_B, p_O) &= n_{AB}\\ E(n_{OO} | \mathbf{X}, p_A, p_B, p_O) &= n_O\\ \end{align*}` .question[ How can we get information about genotype frequencies `\(n_{AA}\)` and `\(n_{AO}\)` given our observed phenotype frequencies? ] --- ### The ABO blood group system Note that `\(n_{AA} + n_{AO} = n_A\)`. Thus, we know `\begin{align*} n_{AA} | n_A \sim Binomial\left(n_{A}, \frac{p^2_A}{p^2_A + 2p_Ap_O} \right) \end{align*}` and so we have `\begin{align*} E(n_{AA} | \mathbf{X}, p_A, p_B, p_O) &= \frac{n_Ap_A^2}{p^2_A + 2p_Ap_O} \end{align*}` --- ### The ABO blood group system Similarly, `\begin{align*} E(n_{BB} | \mathbf{X}, p_A, p_B, p_O) &= \frac{n_Bp_B^2}{p^2_B + 2p_Bp_O}\\ E(n_{AO} | \mathbf{X}, p_A, p_B, p_O) &= \frac{2n_Ap_Ap_O}{p^2_A + 2p_Ap_O}\\ E(n_{BO} | \mathbf{X}, p_A, p_B, p_O) &= \frac{2n_Bp_Ap_O}{p^2_B + 2p_Bp_O} \end{align*}` We can thus plug in these conditional expectations to the expressions for the MLE from the M-step, and use MLE estimates of `\(p_A\)`, `\(p_B\)`, and `\(p_O\)` from the M-step to obtain estimates for these conditional expectations in the E-step, and so on. --- ### The ABO blood group system The EM algorithm for estimating allele frequencies is thus given by 1. Initialize values `\(p_A\)`, `\(p_B\)`, and `\(p_O\)` (note that `\(p_A + p_B + p_O = 1\)`) 2. Estimate genotype counts using current parameter estimates from M-step 3. Update MLEs using parameter estimates and genotype counts from E-step 4. Repeat until convergence criterion is satisfied .question[ Suppose in a group of 300 individuals we observe 135 with blood type A, 39 with blood type B, 108 with blood type O, and 18 with blood type AB. Try implementing the EM algorithm. What initial parameter values did you use? What were your final allele frequencies? ]