Intro to Record Linkage

Rebecca C. Steorts
August 30, 2016

Official Statistics

inline

How do we cope with duplicated administrative records?

Precision Medicine

inline 10%

How do we cope with duplicated medical records?

Syrian Conflict

inline

How do we cope with duplicated deaths in the Syrian conflict?

Syrian Conflict

  • Four human rights groups in Syria comprising approximately 300,000 death records.
  • Information on full Arabic name, date of death (DoD), governanate, and gender.
  • There is some knowledge of ground truth from a Syrian hand matcher. (What issues could arise here)?
  • Data collection is on the ground.
  • There is a great deal of missing data or unrealiable information not used: civilian of military, age, lower level location, etc.

Record Linkage

  • Record linkage joins multiple databases (without unique identifiers) to remove duplicated entities.

  • Record linkage is also known as entity resolution or coreference resolution.

Record Linkage versus De-duplication

  1. Record linkage: merging more than one database to remove duplicated entities.

  2. De-duplication: removing duplicated entities from one database.

  • De-duplication is a special case of record linkage.

Record Linkage

inline

Record Linkage

inline

Record Linkage

inline

Syrian Conflict

inline

Big Questions in Record Linkage

  1. Given the data at hand and questions of interest, what models should?
  2. How can we make record linkage computationally scalable?
  3. What are principled way for comparing features in a data set? Think about differences of languages (English, Chinese, Arabic).

United States Patent and Trademark Office

Which inventor records from the US Patent & Trademark Office (USPTO) database correspond to the same unique individuals?

inline fill

USPTO: 8 million patents, multiple inventors per patent

United States Patent and Trademark Office

inline fill

Human Rights Data

  1. Homicide Victims in Columbia
    • three different homicide record-systems; don’t agree on number of deaths
    • Colombian Census Bureau; Columbian National Police; Colombian Forensics Institute
    • issues with conceptual, methodological differences; geographical coverage
  2. Casualties in the Syrian conflict
    • four different lists
    • data collection on the ground
    • lots of missing data: gender, age, civilian vs military

What features/variables would be useful?

  1. How can we work with text?
  2. What is Exact Matching?
  3. What is Partial Matching?
  4. What kinds of errors might we expect to have?
  5. How will matching improve based on the number of features and quality?

Supervised versus Unsupervised methods

Literature is broken up into two classes mainly (supervised and unsupervised methods)

  1. Labeled Data
    • Pro: can build supervised models; ideas?
    • Con: expensive; how would we get labeled data?
  2. Unlabeled data:
    • Pro: cheap, easy to scrape, parse, collect public data
    • Con: what model should we use, how do we analyze this? What kinds of statistical analyses can we do?

What will this class cover?

Throughout the course the goals of the course are:

  1. Understanding the fundamentals of record linkage.
  2. Reading major papers in record linkage.
    • You present them and give summaries of them.
  3. Implementing code through tasks (see syllabus).
  4. Working on a group project that will be due at the end of the semester (this is optional).

Tenative topic list

  1. Introduction to record linkage.
  2. Simple record linkage methods.
    • Handmatching
    • Exact matching
    • Fellegi Sunter method
    • Limitations of these methods
  3. Blocking methods.
    • Deterministic blocking
    • Probabilistic blocking
  4. Evaluation methods for record linkage and blocking.
  5. Clustering based methods (supervised and unsupervised).
  6. Bayesian methods.

Assignments

  • You will be responsible for presenting papers over the course of the semester.
  • All presentations should be presented using slides and the class should be given a summary of the idea of the paper. This is all individual work.
  • Each week there will be programming tasks assigned (see syllabus).
  • There are optional end of course projects, which is are group based. This is more if you're interested in pursuring record linkage past the class or doing research on this.

Projects (optional)

  • You can work in groups (no large than 2)
  • Write a one–two page summary by the September 30, 2016 at 5 PM (ET) describing:
  • The data you plan to analyze. Please describe it in details and provide references.
  • What are the motivating questions for looking at this data set?
  • What are your initial thoughts for attacking this problem.
  • Please assign one a group leader.
  • You should keep well documented code, notes, and you will turn a 8 page paper of your project at the end of the semester.
  • Your initial proposal should be no longer than 2 pages.
  • Please turn this into Professor Steorts by email.