Research groups

NSF-Census Research Network at Duke (NCRN) Metaknowledge Network NSF: Record Linkage and Privacy Big Data Grant

Research Interests


I focus on recovering high dimensional objects from degraded data and determining how to recover the underlying structure. Methods used for this are record linkage, locality sensitive hashing, privacy-preserving record linkage, and small area estimation as applied to medical studies, fmri studies, human rights violations, and estimation of poverty rates in hard to reach domains.

One main focus is record linkage (de-duplication, coreference resolution, etc), where the goal is to merge many large (and possibly noisy) databases to get rid of duplicate entities. In my approach, the databases are merged by clustering records to hypothesized latent individuals and duplicate records are removed. The key to my work is the insight records to latent individuals. This insight leads to new representations of the problem, new data structures, and new models for linkage that are both flexible and scalable. There are also interesting connections here with variational methods.

Another focus is driven by reducing the space of all-to-all record comparisons for record linkage problems, using dimension reduction techniques of locality sensitive hashing (LSH). Typically, in record linkage, we make up structured rules to form ``blocks" or ``partitions" to reduce the original space. However, recently, we have shown that LSH methods are powerful, fast, and reliable in forming such blocks based on putting records in the same block if they are ``similar enough."

If we flip, the record linkage problem around, we end up with all sorts of issues with privacy. Suppose we links many databases containing medical records. How can we provide privacy guarantees post-linkage? The goal of privacy-protecting record linkage is to limit how much information is revealed about individual-level data by either the linkage or by subsequent analyses of the linked data. I am exploring the decision-theoretic effects of combining random differential privacy with model based record linkage methods. Specifically, how does the utility of more accurate linkage of less distorted data trade off against the risk of violating different individuals' privacy.

Finally, another focus is driven by my thesis work in small area estimation, which deals with disaggregating surveys to small noisy subgroups. To deal with this issue, we borrow strength from surrounding "areas," where areas could be geographical, demographic, etc. My main focus is developing methods for social sciences applications from a decision theoretic point of view that incorporate clustering and spatial smoothing techniques from manifold learning.