Rebecca C. Steorts

August 30, 2016

- Handmatching
- Exact Matching
- Fellegi Sunter Methods
- Limitations

- Evaluations Metrics

Handmatching is one of the most widely used methods in practice. Why?

- A domain knowledge expert compares two records and decides if they are a match or non-match.

Benefits: This is easy and simple to implement. Downsides:

- This is very time consuming to implement in practice.
- It is impossible to assess the accuracy of any handmatched method, especially when the quality of the data is noisy.
- Handmatching is expensive in practice (you must pay experts to match data)!

- If two records agree on all features, then they are a match.
- Otherwise, they are not a match.

- If two records agree on all features
**exept**one, then they are a match. - Otherwise, they are not a match.
There are obvious extensions of this.

Exact matching is very easy to implement and is widely used. Like all record linkage methods, it's not scalable.

- Assume two databases \( A \) and \( B \).
- Let's compare a record two records \( a \in A \) and \( b \in B. \)
- Do records \( a,b \) agree on “gender, DOB year, DOB month”?
- We decide if \( a,b \) are a match/non-match by looking at a likliehood ratio test (LRT).
- If the LRT is above a threshold, then we assume a match, and otherwise we assume a non-match.

- \[ \Gamma_{ab} = (\Gamma_{ab}(1),\ldots, \Gamma_{ab}(q)) \] is a binary vector of comparisons between records \( a,b \) taking values \( \Gamma = \{0,1\}^q \)
- Let \( M \) be a match, \( U \) be a non-match
- \( p_M \) : prob of a match

\[ \begin{aligned} P[(a,b) \in M] &= p_M \\ P[\Gamma_{ab} = g \mid (a,b \in M)] &= \pi_{g \mid M}\\ P[\Gamma_{ab} = g \mid (a,b \in U)] &= \pi_{g \mid U}\\ \end{aligned} \] Putting this together we arrive at a two component mixture model: \[ P[\Gamma_{ab} = p_M \pi_{g \mid M} + (1-p_M)\pi_{g \mid U} \]

- The first component corresponds to the
**true matches**and - the second component correspond to the
**true non-matches**.

- It is possible (see the original paper) to derive through optimizatin techniques estimates for the parameters.
- However, more commonly the EM algorithm is used.
- The EM algorithm is used when you have missing data or latent (unknown variables)

Four classifications of how pairs of records can be linked or not linked under the truth and under the estimate.

- Record pairs can be linked under both the truth and under the estimate, which we refer to as
**correct links**(CL). - Record pairs can
be linked under the truth but not linked under the estimate, which are called
**false negatives**(FN). - Record pairs can be not linked under the truth but
linked under the estimate, which are called
**false positives**(FP). - Record pairs can be not linked under the truth and also not linked under
the estimate, which we refer to as
**correct non-links**(CNL).

\[ FNR = \frac{FN}{CL+FN} \] \[ FPR = \frac{FP}{FP + CNL}. \]

- Note: FNR as defined is note appropriate metric for record linkage. Why?
- Instead, look at FDR.

\[ FDR = \frac{FP}{CL + FP}, \] where by convention take \( FDR = 0 \) if the num and denom are both 0.