Record Linkage and Privacy-Preserving Methods for Big Data

Collaborative Research, National Science Foundation, Penn State University and Duke University.

Co-PIs: Alexandra Slavkovic (PSU) and Rebecca C. Steorts (Duke)

Students: Bai Li (Duke)

Collaborators: Brenda Betancourt (Duke) , Matt Barnes (CMU), Jeff Miller (Harvard Biostat), Willie Neisweinger (CMU) , Hanna Wallach (Microsoft Research), Abbas Zaidi (Duke)

    Summary of Work

    Social entities and their patterns of behavior is a key topic in the social sciences. Growth of modern information infrastructure, ease of data collection and storage, and development of novel computational data analyses techniques have invigorated such research. Furthermore, many large data application in medical databases, genetics, official statistics, human rights violations, and others, often contain sensitive and confidential information about social entities and their relations. In order to understand some underlying information about society, we often merge such databases, which often contain duplicated entities. After databases are merged, the entities can be uniquely identified, causing questions regarding how to protect the privacy of individuals and confidential information. We seek to understand how to both merge databases to identify entities and structure but also preserve privacy after such a merging (or linkage) process. (The process of merging databases is known as record linkage, de-duplication, or entity resolution). The methods we advocate, allow for exact uncertainty to be assessed throughout either a linkage or privacy process, thus, allowing us to estimate the number of patients in a multiple medical databases with a standard error, but then also to provide a protection on the privacy of such individuals after the linkage process.

    We propose improvements in record linkage and privacy using state- of-the-art techniques both from statistics and machine learning, focusing on three themes. First, we develop novel methods for privacy-preserving record linkage (PPRL), making them probabilistic in nature using generative Bayesian methods. We refer to this as Bayesian privacy-preserving record linkage (BPPRL), where the record linkage model has a clustering mechanism for sparse data. We utilize what we call a small clustering framework, which assumes that the the number of records grows sub-linearly with the data. We can then perform PPRL for k databases and propagate uncertainty of both linkage and privacy preservation throughout the PPRL protocol. Second, we propose to develop way of releasing synthetic data post-linkage, with the guarantee of differential privacy and its relaxations. Finally, we propose ``big data" methods for both BPPRL and general privacy problems building on the sparse clustering framework and variational Bayesian methods.


    joint work with Jeff Miller, Brenda Betancourt, Abbas Zaidi (Duke University), Hanna Wallach (Microsoft Research), and Rebecca C. Steorts (Duke).

    We investigated models for latents entities that are non-exchanageable and apply in record linkage applications. In terms of the latent entities, it is well known that for record linkage, network problems, and other \emph{small clustering problems} while the number of records (or nodes) grows, the \emph{number} of latent entities remains fixed. This implies that the latent entities or clusters are not exchangeable, and hence, popularized methods such as the Kingman paintbox and Dirichlet mixture models are not appropriate distributions on the latent entities (see Wallach et al. (2010) and Broderick and Steorts (2014)). Given this observation, we build methodology and did empirical checks (prior predictive checks), to understand how the proposed model in Miller, Betancourt, Zaidi, Wallach, and Steorts (2015) compares to currently used Bayesian nonparametric techniques, namely the Pitman-Yor and Dirchlet Process. Prior predictive checks were done using real data from official statistics and simulated data. This work can be found at

    Releasing synthetic categorical data that is differential private based upon generalized hash functions

    joint work with Bai Li (Duke), Vishesh Karwa (CMU and Harvard), Alexandra Slavkovic (PSU), and Rebecca C. Steorts (Duke).

    We have investigated two existing algorithms release synthetic data, as sparse histograms. Each algorithm satisfies differential privacy, however, we look at the tradeoffs of data utility, sensitivity of tuning parameters, as well as computational scalability. In terms of computational scalability, we investigate the use of hash functions to compress large contingency tables (or histograms) into a low dimensional object or structure.

    Performance Bounds for Graphical Record Linkage

    joint work Matt Barnes, Willie Neiswanger (CMU), and Rebecca C. Steorts (Duke).

    Record linkage involves identifying duplicate records in large, noisy databases. Traditional linkage methods directly linking records to one another are computationally infeasible as the number of records grows. As a result, it is increasingly common for researchers to treat linkage as a clustering task, in which each latent entity is associated with one or more noisy database records, and the inferential goal is to identify the latent entity underlying each observed database record. We investigate and critically assess performance bounds using the Kullback-Leibler (KL) divergence under a general record linkage framework. We derive an upper bound using the KL divergence and a lower bound on the minimum probability of misclassifying a latent entity. We give insight for when our bounds hold using simulated data.