Very often information about social entities is scattered across multiple databases. Combining that information into one database can result in enormous benefits for analysis, resulting in richer and more reliable conclusions. Among the types of questions that have been, and can be, addressed by combining information include: How accurate are census enumerations for minority groups? How many of the elderly are at high risk for sepsis in different parts of the country? How many people were victims of war crimes in recent conflicts in Syria?
In most practical applications, however, analysts cannot simply link records across databases based on unique identifiers, such as social security numbers, either because they are not a part of some databases or are not available due to privacy concerns. In such cases, analysts need to use methods from statistical and computational science known as {\em record linkage} (also called {\em entity resolution} or {\em de-duplication}) to proceed with analysis. Record linkage is not only a crucial task for social science and industrial applications, but is a challenging statistical and computational problem itself, because many databases contain errors (noise, lies, omissions, duplications, etc.), and the number of parameters to be estimated grows with the number of records.
The objective of this course are to provide an introduction to record linkage methodology and computational tools. This will be achieved by reading papers, lectures, group discussions, and student led discussions on papers that are assigned weekly. Students will have the opportunity to complete computational (coding) tasks to better their understanding of record linkage and how it is useful on both synthetic and real data sets.
Please see the syllabus on the course for more information: Syllabus
Talks will be assigned to students in class based on their interest on each topic. Please start looking through the talks that are available.