Intro to Record Linkage Data Sets

Rebecca C. Steorts
August 30, 2016

Let's investigate what record linkage data sets look like in practice.

Data sets

• Synthetic data (RLData500)
• Italian Official statistics data (SHIW)
• Health care data from NC
• NC voter registration data
• Data from the Syrian conflict (available for potential project)

Synthetic data (RLData500)

• This is one database of 500 records with 10 percent duplicate records.
• This is a synthetic data set.
• There are unique ids available.
• Features: first name, last name, year, month, and day of birth
• Data set is in the Record Linkage package in R.
• There is a larger version of this data set called RLdata10000 (10,000 records instead of 5,000 records).

Synthetic data (RLData500)

## load packages
data(RLdata500)
## read in and remove missing values
myDat <- RLdata500[,-c(2,4)]
dim(myDat)

[1] 500   5

head(myDat)

  fname_c1 lname_c1   by bm bd
1  CARSTEN    MEIER 1949  7 22
2     GERD    BAUER 1968  7 27
3   ROBERT HARTMANN 1930  4 30
4   STEFAN    WOLFF 1957  9  2
5     RALF  KRUEGER 1966  1 13
6  JUERGEN   FRANKE 1929  7  4


Synthetic data (RLData500)

## unique ids
uniqueID <- identity.RLdata500

[1]  34  51 115 189  72 142


Synthetic data (RLData500)

(clusterSize <- table(table(uniqueID)))


1   2
400  50

barplot(clusterSize, xlab="cluster size", ylab="number of records in each cluster", col="blue", width=c(2,2))


Survey on Household Income and Wealth (SHIW)

• Survey conducted by the Bank of Italy every two years.
• Use the 2008 database and 2010 data sets from the Fruili region, which consist of 789 records
• There are no duplicates within each data set (verify this on your own).
• There are duplicates across the two data sets.
• The features we have available: gender, year of birth, quality of life, etc (all ordinal variables).

Survey on Household Income and Wealth (SHIW)

fDat08 <- read.table("../../datasets/SHIW/v1_08_r6.txt",header=TRUE)
dim(fDat08)

[1] 434   6

dim(fDat10)

[1] 355   6


Survey on Household Income and Wealth (SHIW)

## Let's combine the two data sets
fDat <- rbind(fDat08, fDat10)

       id SEX ANASC STUDIO QUAL SETT
1 2160221   1  1941      5    6    5
2  222511   1  1928      3    6    5
3  222621   1  1941      5    6    5
4  222631   2  1931      3    6    5
5  222632   1  1960      4    1    2
6  222661   1  1926      3    6    5

## Let's keep the unique id separate.
funiqueID <- fDat\$id
fDat <- fDat[,-1]


Survey on Household Income and Wealth (SHIW)

(fclusterSize <- table(table(funiqueID)))


1   2
385 202

barplot(fclusterSize, xlab="cluster size", ylab="number of records in each cluster", col="red", width=c(2,2))


How does exact matching work on SHIW?

# Let's just work with the first 10 records
ital <- fDat[1:10,]
# Let's initialize the variables we need
true.link <- exact.match <- near.twin <- rep(F,choose(dim(ital)[1],2))
for(i in 1:(dim(ital)[1]-1)){
for(j in (i+1):dim(ital)[1]){
rec.i <- ital[i,]
rec.j <- ital[j,]
if(all(rec.i[-1]==rec.j[-1])){exact.match[i] <- T}
if(sum(rec.i[-1]!=rec.j[-1])<=1){near.twin[i] <- T}
}
}

1. Let's calculated the FNR and FDR for exact and near twin matching.
2. Think about how to generalize the code for any data sets
3. How would you also speed up the code? (Get rid of the for loops).

How does exact matching work on SHIW? (continued)

(exact.match.fnr <- sum(true.link & !exact.match)/sum(true.link))

[1] 0.875

(exact.match.fdr <- sum(exact.match & !true.link)/sum(exact.match))

[1] 0

(near.twin.fnr <- sum(true.link & !near.twin)/sum(true.link))

[1] 0.25

(near.twin.fdr <- sum(near.twin & !true.link)/sum(near.twin))

[1] 0.1428571