STA 794: Modern Advancements in Record Linkage

Rebecca C. Steorts
Duke University, Assistant Professor of Statistical Science and Computer Science
Contact: beka@ sloth dot gator dot edu

In the contact, replace sloth and gator with the appropriate words.

Description

Very often information about social entities is scattered across multiple databases. Combining that information into one database can result in enormous benefits for analysis, resulting in richer and more reliable conclusions. Among the types of questions that have been, and can be, addressed by combining information include: How accurate are census enumerations for minority groups? How many of the elderly are at high risk for sepsis in different parts of the country? How many people were victims of war crimes in recent conflicts in Syria?

In most practical applications, however, analysts cannot simply link records across databases based on unique identifiers, such as social security numbers, either because they are not a part of some databases or are not available due to privacy concerns. In such cases, analysts need to use methods from statistical and computational science known as {\em record linkage} (also called {\em entity resolution} or {\em de-duplication}) to proceed with analysis. Record linkage is not only a crucial task for social science and industrial applications, but is a challenging statistical and computational problem itself, because many databases contain errors (noise, lies, omissions, duplications, etc.), and the number of parameters to be estimated grows with the number of records.

The objective of this course are to provide an introduction to record linkage methodology and computational tools. This will be achieved by reading papers, lectures, group discussions, and student led discussions on papers that are assigned weekly. Students will have the opportunity to complete computational (coding) tasks to better their understanding of record linkage and how it is useful on both synthetic and real data sets.

Please see the syllabus on the course for more information: Syllabus

Top

Course Topics

Week 1 (August 30): An introduction to Record Linkage

Week 2 (September 6): The Fellegi-Sunter Method

Assigned papers: Fellegi Sunter (1969), Murray (2015).
Presenter: Bai Li
Slides: An Introduction to Record Linkage
Summary:
Assigned Tasks:

Week 3 (September 13): Record Linkage Workshop held in Cambridge, UK.

All talks will be streamed live at https://www.newton.ac.uk/events/streaming.
Please pick on talk to go to (there will be three during class time).
Turn in a summary of the talk that you attend.
Be prepared to give a short summary of the talk during the next class period either with or without slides. (Your presentation should be 10--15 minutes).
There will also be a hackaton in the UK. More details to come.
The schedule of talks are available here: https://www.newton.ac.uk/event/dlaw02/timetable
Talks will be assigned to students in class based on their interest on each topic. Please start looking through the talks that are available.

Week 4 and Week 5 (September 20 and 27): Presentations from INI meeting (one per student). Please note that Cambridge is 5 hours ahead of the US ET time.

Week 6 (October 4): Blocking methods

Assigned papers: Steorts, Ventura, Sadinle, Fienberg (2014), Sadosky, Shrivastava, Price, and Steorts (2015), and Mining of Massive Datasets (Chapter 3).

Chai presenting Steorts, Ventura, Sadinle, Fienberg (2014) paper.
Chai summary of Steorts et al (2014) paper.
Foster presenting Sadosky et al (2015) paper.
Foster summary of Sadosky et al (2015) paper.

Week 7 (October 11): No class due to fall break scheduled.

Week 8 (October 18): Supervised Record Linkage: Random Forests, Forest of Random Forests, and Bayesian Forests.

Assigned papers: TBD.
Presenters:
Slides:
Summaries:
Assigned Tasks:

Week 9 (October 25): Bayesian Graphical Record Linkage

Assigned papers: Steorts, Hall and Fienberg (2014, 2016)
Presenters:
Slides:
Summaries:
Assigned Tasks:

Week 9 (November 1): Bayesian Graphical Record Linkage

Assigned papers: Steorts (2015)
Presenters:
Slides:
Summaries:
Assigned Tasks:

Week 9 (November 9): Bayesian Graphical Record Linkage

Assigned papers: Betancourt, Zanella, Wallach, Zaidi, Steorts (2016), "Flexible Models for Microclustering with Applications to Entity Resolution" NIPS, To Appear.
Presenters:
Slides:
Summaries:
Assigned Tasks:

Top

Record Linkage Readings

Assigned Readings

Blocking Readings

Assigned Readings

Optional Readings

Homework Assignments

Datasets and Public Packages

For public data sets and packages that can be used for future homeworks please see the public repository at RECLINK Toolbox

Top