Intro to Record Linkage Data Sets

Rebecca C. Steorts
August 30, 2016

Record Linkage Data Sets

Let's investigate what record linkage data sets look like in practice.

Data sets

  • Synthetic data (RLData500)
  • Italian Official statistics data (SHIW)
  • Health care data from NC
  • NC voter registration data
  • Data from the Syrian conflict (available for potential project)
  • If you have access to dataset that is shareable with the class involving record linkage, please let me know!

Synthetic data (RLData500)

  • This is one database of 500 records with 10 percent duplicate records.
  • This is a synthetic data set.
  • There are unique ids available.
  • Features: first name, last name, year, month, and day of birth
  • Data set is in the Record Linkage package in R.
  • There is a larger version of this data set called RLdata10000 (10,000 records instead of 5,000 records).

Synthetic data (RLData500)

## load packages
library(RecordLinkage)
data(RLdata500)
## read in and remove missing values
myDat <- RLdata500[,-c(2,4)]
dim(myDat)
[1] 500   5
head(myDat)
  fname_c1 lname_c1   by bm bd
1  CARSTEN    MEIER 1949  7 22
2     GERD    BAUER 1968  7 27
3   ROBERT HARTMANN 1930  4 30
4   STEFAN    WOLFF 1957  9  2
5     RALF  KRUEGER 1966  1 13
6  JUERGEN   FRANKE 1929  7  4

Synthetic data (RLData500)

## unique ids
uniqueID <- identity.RLdata500
head(uniqueID)
[1]  34  51 115 189  72 142

Synthetic data (RLData500)

(clusterSize <- table(table(uniqueID)))

  1   2 
400  50 
barplot(clusterSize, xlab="cluster size", ylab="number of records in each cluster", col="blue", width=c(2,2))

plot of chunk unnamed-chunk-3

Survey on Household Income and Wealth (SHIW)

  • Survey conducted by the Bank of Italy every two years.
  • Use the 2008 database and 2010 data sets from the Fruili region, which consist of 789 records
  • There are no duplicates within each data set (verify this on your own).
  • There are duplicates across the two data sets.
  • The features we have available: gender, year of birth, quality of life, etc (all ordinal variables).

Survey on Household Income and Wealth (SHIW)

fDat08 <- read.table("../../datasets/SHIW/v1_08_r6.txt",header=TRUE)
fDat10 <- read.table("../../datasets/SHIW/v2_10_r6.txt",header=TRUE)
dim(fDat08)
[1] 434   6
dim(fDat10)
[1] 355   6

Survey on Household Income and Wealth (SHIW)

## Let's combine the two data sets
fDat <- rbind(fDat08, fDat10)
head(fDat)
       id SEX ANASC STUDIO QUAL SETT
1 2160221   1  1941      5    6    5
2  222511   1  1928      3    6    5
3  222621   1  1941      5    6    5
4  222631   2  1931      3    6    5
5  222632   1  1960      4    1    2
6  222661   1  1926      3    6    5
## Let's keep the unique id separate.
funiqueID <- fDat$id
fDat <- fDat[,-1]

Survey on Household Income and Wealth (SHIW)

(fclusterSize <- table(table(funiqueID)))

  1   2 
385 202 
barplot(fclusterSize, xlab="cluster size", ylab="number of records in each cluster", col="red", width=c(2,2))

plot of chunk unnamed-chunk-6

How does exact matching work on SHIW?

# Let's just work with the first 10 records
ital <- fDat[1:10,]
# Let's initialize the variables we need
true.link <- exact.match <- near.twin <- rep(F,choose(dim(ital)[1],2))
for(i in 1:(dim(ital)[1]-1)){
    for(j in (i+1):dim(ital)[1]){
        rec.i <- ital[i,]
        rec.j <- ital[j,]
        if(rec.i[1]==rec.j[1]){true.link[i] <- T}
        if(all(rec.i[-1]==rec.j[-1])){exact.match[i] <- T}
        if(sum(rec.i[-1]!=rec.j[-1])<=1){near.twin[i] <- T}
    }
}
  1. Let's calculated the FNR and FDR for exact and near twin matching.
  2. Think about how to generalize the code for any data sets
  3. How would you also speed up the code? (Get rid of the for loops).

How does exact matching work on SHIW? (continued)

(exact.match.fnr <- sum(true.link & !exact.match)/sum(true.link))
[1] 0.875
(exact.match.fdr <- sum(exact.match & !true.link)/sum(exact.match))
[1] 0
(near.twin.fnr <- sum(true.link & !near.twin)/sum(true.link))
[1] 0.25
(near.twin.fdr <- sum(near.twin & !true.link)/sum(near.twin))
[1] 0.1428571

Task 1

  • Implement exact matching
  • Implement off by one matching (near twin matching)
  • Implement off by two matching
  • Make this into a function called ExactMatcher
  • Make sure nothing is hard coded into the function but use the example from class to help you!

Task 2

  • In a separate file Evaluate your results on RLdata500 and SHIW, call these files evaluteRLdata500 and evaluationSHIW, respectively.
  • Think about parts of the code that you can write into functions when possible.
  • How do you know that your algorithms are performing well?
  • What is the computational run time of your programs?