The EM algorithm (2)

class: center, middle, inverse, title-slide

# The EM algorithm (2)
### Yue Jiang
### Duke University

---

### The EM algorithm

The .vocab[EM algorithm] (expectation-maximization) is an iterative procedure
that often leverages a .vocab[complete-data likelihood] to perform estimation
about parameters in the observed data.

1. Initialize parameter values
2. .vocab[E-step]: construct expected log-likelihood function, where expectation
takes advantage of latent variable formulation and is taken using parameter
estimates from current M-step
3. .vocab[M-step]: calculate maximum likelihood estimates from expected 
log-likelihood function constructed from current E-step to inform distribution
of latent variables
4. Repeat until convergence criterion is satisfied

We "guess" the latent variables and use them to maximize an "easier" likelihood.
The EM algorithm hinges on defining a useful complete data log-likelihood.

---

### A few caveats

The M-step doesn't maximize the *observed* log-likelihood, but rather a 
surrogate function given by the conditional expectation of the complete data
log-likelihood.

The good news is that at each iteration the value of the observed log-likelihood
will never decrease. However, if initial values are chosen poorly, the EM 
algorithm may get stuck in a local maximum or stationary value.

In the previous example we found a closed form solution to the M-step (these 
are simply MLE estimates from normal distributions). However, sometimes no 
closed form exists and we must rely on numerical methods to obtain solutions 
(e.g., EM algorithm with one-step Newton-Raphson in the M step).

---

### The EM algorithm

Note that the E and M steps "cycle," and so we don't have to start with one
or the other.

In practice, the convergence criterion is often based on changes in the 
log-likelihood function evaluated at parameter estimates at each iteration.

---

### The ABO blood group system

---

### The ABO blood group system

.pull-left[
Human blood group refers to a classification of blood based on presence or
absence of certain surface antigens on red blood cells.

One of the most important blood group systems is the ABO system; during blood 
transfusion, if mismatches occur in the ABO blood group system, potentially 
fatal reactions may occur. 
]

.pull-right[
<img src="img/karl.jpg" width="80%" style="display: block; margin: auto;" />
]

---

### The ABO blood group system

.pull-left[
<img src="img/abo.jpg" width="100%" style="display: block; margin: auto;" />
.
]

.pull-right[
- Three alleles: A, B, and O
- One allele inherited from each parent
- A and B alleles are co-dominant; O is recessive
- Six unique genotypes given in table to the left
- Four phenotypes possible given these genotypes: A-type blood, B-type blood, 
AB-type blood, and O-type blood
]

---

### The ABO blood group system

.pull-left[
<img src="img/abo.jpg" width="100%" style="display: block; margin: auto;" />
.
]

.pull-right[
Let `\(p_A\)`, `\(p_B\)`, and `\(p_O\)` be .vocab[allele frequencies] of underlying alleles. 
Then .vocab[genotype frequencies] are:
- AA: `\(p_A^2\)`
- BB: `\(p_B^2\)`
- AB: `\(2p_Ap_B\)`
- AO: `\(2p_Ap_O\)`
- BO: `\(2p_Bp_O\)`
- OO: `\(p^2_O\)`
]
 
---

### Allele frequency estimation

Assume genotype counts are multinomially distributed:

`\begin{align*}
P(N_{AA} = n_{AA}, \cdots, N_{OO} = n_{OO}) = \frac{n!}{n_{AA}\cdots!n_{OO}!}p_{AA}^{n_{AA}}\cdots p_{OO}^{n_{OO}},
\end{align*}`
for `\(\mathcal{B} = \{AA, BB, AB, AO, BO, OO \}\)`, and where `\(\sum\limits_{i \in \mathcal{B}} n_{i} = n\)`.

We observe the phenotype counts `\(N_A\)`, `\(N_B\)`, `\(N_{AB}\)` and `\(N_O\)` only.

.question[
Using only *observed phenotype counts*, can we estimate **allele frequencies**
in the underlying population? What is the observed data likelihood?
]

---

# Allele frequency estimation

Using the observed *phenotype* frequencies, we have:

`\begin{align*}
\mathcal{L}(p_A, p_B, p_O) &\propto \left(p_A^2 + 2p_Ap_O \right)^{n_A}\left(p_B^2 + 2p_Bp_O \right)^{n_B}\times\\
&\mathrel{\phantom{=}}\left(2p_Ap_B\right)^{n_{AB}}\left(p_O^{2}\right)^{n_O}\\
\log\mathcal{L}(p_A, p_B, p_O) &= n_A\log\left(p_A^2 + 2p_Ap_O \right) + \\
&\mathrel{\phantom{=}} n_A\log\left(p_A^2 + 2p_Ap_O \right) + \\
&\mathrel{\phantom{=}} n_{AB}\log(2p_Ap_B) + 2n_O \log(p_O) + \\
&\mathrel{\phantom{=}} const.
\end{align*}`

Setting partial derivatives equal to zero and solving is a disaster.

.question[
How would you use a latent variable to write the complete data log-likelihood 
using the *genotype* frequencies?
]

---

### The ABO blood group system

The complete data log-likelihood is

`\begin{align*}
\log\mathcal{L}(p_A, p_B, p_O) &= n_{AA}\log(p_A^2) + n_{BB}\log(p_B^2) + \\
&\mathrel{\phantom{=}} n_{AO}\log(2p_Ap_O) + n_{BO}\log(2p_Bp_O) + \\
&\mathrel{\phantom{=}} n_{AB}\log(2p_Ap_B) + n_{OO}\log(p_O^2) + const.,
\end{align*}`

which we notice is much easier to maximize (no more sums of different parameters
inside a log).

.question[
How could you maximize this log-likelihood function with respect to the 
parameters of interest (watch out for a constraint)?
]

---

### The ABO blood group system

Let's calculate MLEs for `\(p_A\)`, `\(p_B\)`, and `\(p_O\)` using the complete data
log-likelihood. We can use Lagrange multipliers to solve the optimization
problem under the constraint `\(p_A + p_B + p_O = 1\)` (again, no work shown; please
verify!):

`\begin{align*}
\hat{p}_A &= \frac{2n_{AA} + n_{AO} + n_{AB}}{2n}\\
\hat{p}_B &= \frac{2n_{BB} + n_{BO} + n_{AB}}{2n}\\
\hat{p}_O &= \frac{2n_{OO} + n_{AO} + n_{BO}}{2n}.
\end{align*}`

.question[
Now how about the E-step?
]

---

### The ABO blood group system

Once again, we never observe the latent variables (in this case the genotype
frequencies). However we can again get useful information by maximizing 
conditional expectations of the log-likelihood for the latent data, given the 
observed data and model parameters.

For observed blood types `\(AB\)` and `\(O\)` which can only happen given a single 
genotype combination we have

`\begin{align*}
E(n_{AB} | \mathbf{X}, p_A, p_B, p_O) &= n_{AB}\\
E(n_{OO} | \mathbf{X}, p_A, p_B, p_O) &= n_O\\
\end{align*}`

.question[
How can we get information about genotype frequencies `\(n_{AA}\)` and `\(n_{AO}\)` 
given our observed phenotype frequencies?
]

---

### The ABO blood group system

Note that `\(n_{AA} + n_{AO} = n_A\)`. Thus, we know

`\begin{align*}
n_{AA} | n_A \sim Binomial\left(n_{A}, \frac{p^2_A}{p^2_A + 2p_Ap_O} \right)
\end{align*}`

and so we have

`\begin{align*}
E(n_{AA} | \mathbf{X}, p_A, p_B, p_O) &= \frac{n_Ap_A^2}{p^2_A + 2p_Ap_O}
\end{align*}`

---

### The ABO blood group system

Similarly,

`\begin{align*}
E(n_{BB} | \mathbf{X}, p_A, p_B, p_O) &= \frac{n_Bp_B^2}{p^2_B + 2p_Bp_O}\\
E(n_{AO} | \mathbf{X}, p_A, p_B, p_O) &= \frac{2n_Ap_Ap_O}{p^2_A + 2p_Ap_O}\\
E(n_{BO} | \mathbf{X}, p_A, p_B, p_O) &= \frac{2n_Bp_Ap_O}{p^2_B + 2p_Bp_O}
\end{align*}`

We can thus plug in these conditional expectations to the expressions for the 
MLE from the M-step, and use MLE estimates of `\(p_A\)`, `\(p_B\)`, and `\(p_O\)` from the 
M-step to obtain estimates for these conditional expectations in the E-step, and 
so on.

---

### The ABO blood group system

The EM algorithm for estimating allele frequencies is thus given by

1. Initialize values `\(p_A\)`, `\(p_B\)`, and `\(p_O\)` (note that `\(p_A + p_B + p_O = 1\)`)
2. Estimate genotype counts using current parameter estimates from M-step
3. Update MLEs using parameter estimates and genotype counts from E-step
4. Repeat until convergence criterion is satisfied

.question[
Suppose in a group of 300 individuals we observe 135 with blood type A, 39 with 
blood type B, 108 with blood type O, and 18 with blood type AB. Try implementing 
the EM algorithm. What initial parameter values did you use? What were your 
final allele frequencies?
]