Naive Record Linkage Methods

Rebecca C. Steorts
August 30, 2016

Naive Record Linkage Methods

Handmatching
Exact Matching
Fellegi Sunter Methods
- Limitations
Evaluations Metrics

Handmatching

Handmatching is one of the most widely used methods in practice. Why?

A domain knowledge expert compares two records and decides if they are a match or non-match.

Benefits: This is easy and simple to implement. Downsides:

This is very time consuming to implement in practice.
It is impossible to assess the accuracy of any handmatched method, especially when the quality of the data is noisy.
Handmatching is expensive in practice (you must pay experts to match data)!

Exact Matching

If two records agree on all features, then they are a match.
Otherwise, they are not a match.

Near-Twin Matching

If two records agree on all features exept one, then they are a match.
Otherwise, they are not a match.
There are obvious extensions of this.
Exact matching is very easy to implement and is widely used. Like all record linkage methods, it's not scalable.

Fellegi Sunter Method (Murray, 2015)

Assume two databases \( A \) and \( B \).
Let's compare a record two records \( a \in A \) and \( b \in B. \)
Do records \( a,b \) agree on “gender, DOB year, DOB month”?
We decide if \( a,b \) are a match/non-match by looking at a likliehood ratio test (LRT).
If the LRT is above a threshold, then we assume a match, and otherwise we assume a non-match.

Notation

\[ \Gamma_{ab} = (\Gamma_{ab}(1),\ldots, \Gamma_{ab}(q)) \] is a binary vector of comparisons between records \( a,b \) taking values \( \Gamma = \{0,1\}^q \)
Let \( M \) be a match, \( U \) be a non-match
\( p_M \) : prob of a match

Proposed FS Model

\[ \begin{aligned} P[(a,b) \in M] &= p_M \\ P[\Gamma_{ab} = g \mid (a,b \in M)] &= \pi_{g \mid M}\\ P[\Gamma_{ab} = g \mid (a,b \in U)] &= \pi_{g \mid U}\\ \end{aligned} \] Putting this together we arrive at a two component mixture model: \[ P[\Gamma_{ab} = p_M \pi_{g \mid M} + (1-p_M)\pi_{g \mid U} \]

The first component corresponds to the true matches and
the second component correspond to the true non-matches.

Estimation of Unknown Parameters

It is possible (see the original paper) to derive through optimizatin techniques estimates for the parameters.
However, more commonly the EM algorithm is used.
The EM algorithm is used when you have missing data or latent (unknown variables)

Evaluation Metrics

Four classifications of how pairs of records can be linked or not linked under the truth and under the estimate.

Record pairs can be linked under both the truth and under the estimate, which we refer to as correct links (CL).
Record pairs can be linked under the truth but not linked under the estimate, which are called false negatives (FN).
Record pairs can be not linked under the truth but linked under the estimate, which are called false positives (FP).
Record pairs can be not linked under the truth and also not linked under the estimate, which we refer to as correct non-links (CNL).

\[ FNR = \frac{FN}{CL+FN} \] \[ FPR = \frac{FP}{FP + CNL}. \]

Note: FNR as defined is note appropriate metric for record linkage. Why?
Instead, look at FDR.

\[ FDR = \frac{FP}{CL + FP}, \] where by convention take \( FDR = 0 \) if the num and denom are both 0.