Homepage of Rebecca C. Steorts

Record Linkage and Privacy-Preserving Methods for Big Data

Collaborative Research, National Science Foundation, Penn State University and Duke University.

Co-PIs: Alexandra Slavkovic (PSU) and Rebecca C. Steorts (Duke)

Students: Bai Li (Duke)

Collaborators: Brenda Betancourt (Duke) , Matt Barnes (CMU), Jeff Miller (Harvard Biostat), Willie Neisweinger (CMU) , Hanna Wallach (Microsoft Research), Abbas Zaidi (Duke)

Summary of Work

We propose improvements in record linkage and privacy using state- of-the-art techniques both from statistics and machine learning, focusing on three themes. First, we develop novel methods for privacy-preserving record linkage (PPRL), making them probabilistic in nature using generative Bayesian methods. We refer to this as Bayesian privacy-preserving record linkage (BPPRL), where the record linkage model has a clustering mechanism for sparse data. We utilize what we call a small clustering framework, which assumes that the the number of records grows sub-linearly with the data. We can then perform PPRL for k databases and propagate uncertainty of both linkage and privacy preservation throughout the PPRL protocol. Second, we propose to develop way of releasing synthetic data post-linkage, with the guarantee of differential privacy and its relaxations. Finally, we propose ``big data" methods for both BPPRL and general privacy problems building on the sparse clustering framework and variational Bayesian methods.

Microclustering

We investigated models for latents entities that are non-exchanageable and apply in record linkage applications. In terms of the latent entities, it is well known that for record linkage, network problems, and other \emph{small clustering problems} while the number of records (or nodes) grows, the \emph{number} of latent entities remains fixed. This implies that the latent entities or clusters are not exchangeable, and hence, popularized methods such as the Kingman paintbox and Dirichlet mixture models are not appropriate distributions on the latent entities (see Wallach et al. (2010) and Broderick and Steorts (2014)). Given this observation, we build methodology and did empirical checks (prior predictive checks), to understand how the proposed model in Miller, Betancourt, Zaidi, Wallach, and Steorts (2015) compares to currently used Bayesian nonparametric techniques, namely the Pitman-Yor and Dirchlet Process. Prior predictive checks were done using real data from official statistics and simulated data. This work can be found at http://arxiv.org/abs/1512.00792

Releasing synthetic categorical data that is differential private based upon generalized hash functions

We have investigated two existing algorithms release synthetic data, as sparse histograms. Each algorithm satisfies differential privacy, however, we look at the tradeoffs of data utility, sensitivity of tuning parameters, as well as computational scalability. In terms of computational scalability, we investigate the use of hash functions to compress large contingency tables (or histograms) into a low dimensional object or structure.

Performance Bounds for Graphical Record Linkage

Record linkage involves identifying duplicate records in large, noisy databases. Traditional linkage methods directly linking records to one another are computationally infeasible as the number of records grows. As a result, it is increasingly common for researchers to treat linkage as a clustering task, in which each latent entity is associated with one or more noisy database records, and the inferential goal is to identify the latent entity underlying each observed database record. We investigate and critically assess performance bounds using the Kullback-Leibler (KL) divergence under a general record linkage framework. We derive an upper bound using the KL divergence and a lower bound on the minimum probability of misclassifying a latent entity. We give insight for when our bounds hold using simulated data.