STA 794: Modern Advancements in Record Linkage

Rebecca C. Steorts
Duke University, Assistant Professor of Statistical Science and Computer Science
Contact: beka@ sloth dot gator dot edu

In the contact, replace sloth and gator with the appropriate words.


Very often information about social entities is scattered across multiple databases. Combining that information into one database can result in enormous benefits for analysis, resulting in richer and more reliable conclusions. Among the types of questions that have been, and can be, addressed by combining information include: How accurate are census enumerations for minority groups? How many of the elderly are at high risk for sepsis in different parts of the country? How many people were victims of war crimes in recent conflicts in Syria?

In most practical applications, however, analysts cannot simply link records across databases based on unique identifiers, such as social security numbers, either because they are not a part of some databases or are not available due to privacy concerns. In such cases, analysts need to use methods from statistical and computational science known as {\em record linkage} (also called {\em entity resolution} or {\em de-duplication}) to proceed with analysis. Record linkage is not only a crucial task for social science and industrial applications, but is a challenging statistical and computational problem itself, because many databases contain errors (noise, lies, omissions, duplications, etc.), and the number of parameters to be estimated grows with the number of records.

The objective of this course are to provide an introduction to record linkage methodology and computational tools. This will be achieved by reading papers, lectures, group discussions, and student led discussions on papers that are assigned weekly. Students will have the opportunity to complete computational (coding) tasks to better their understanding of record linkage and how it is useful on both synthetic and real data sets.

Please see the syllabus on the course for more information: Syllabus


Course Topics

  • Week 1 (August 30): An introduction to Record Linkage

  • Week 2 (September 6): The Fellegi-Sunter Method

  • Week 3 (September 13): Record Linkage Workshop held in Cambridge, UK.

  • Week 4 and Week 5 (September 20 and 27): Presentations from INI meeting (one per student). Please note that Cambridge is 5 hours ahead of the US ET time.

  • Week 6 (October 4): Blocking methods

  • Week 7 (October 11): No class due to fall break scheduled.

  • Week 8 (October 18): Supervised Record Linkage: Random Forests, Forest of Random Forests, and Bayesian Forests.
  • Week 9 (October 25): Bayesian Graphical Record Linkage
  • Week 9 (November 1): Bayesian Graphical Record Linkage
  • Week 9 (November 9): Bayesian Graphical Record Linkage


    Record Linkage Readings

  • Assigned Readings

    Blocking Readings

  • Assigned Readings

    Optional Readings

    Homework Assignments

    Datasets and Public Packages

    For public data sets and packages that can be used for future homeworks please see the public repository at RECLINK Toolbox