I participate in both applied and methodological research in statistical science. I am most
interested in applications involving social science and public policy, although I enjoy working with researchers in all
disciplines. My methodological research focuses mainly on statistical methods for protecting data
confidentiality, for handling missing data, for combining information from multiple data sources, and for modeling complex
data including methods for causal inferences. Most often, but not always, I derive methodology from Bayesian
statistical principles.
Research on Statistical Disclosure Limitation
Many agencies collect data that they intend to share with others.
However, releasing the data as collected might
reveal data subjects' identities or sensitive attributes. Simple
actions like stripping names and addresses may not sufficiently
protect confidentiality when the data contain other variables, such as
demographic variables or employment/education histories, that
ill-intentioned users can match to external data files. Thus, agencies need
sophisticated methods to facilitate safe data sharing and dissemination. I work extensively
on theories and applications of such methods, including (i) synthetic
data in which original data values at high risk of disclosure are replaced with values
simulated from probability distributions specified to reproduce as many
of the relationships in the original data as possible, (ii) metrics that agencies
can use to quantify risks of disclosures and potential loss of information due to confidentiality protection
procedures, including both formal privacy and Bayesian probabiilties of disclosure, and (iii) methods of secure data analysis that offer users output from
statistical models without allowing them to see the actual data. I am particularly interested in
developing applications for government databases; for example, the U.S. Census Bureau disseminates synthetic data
for the Survey of Income and Program Participation, the American Community Survey group quarters data, the OnTheMap program, and the Longitudinal Business Database. I have advocated for an integrated system for providing data access, including (i) synthetic data as public use files coupled with (ii) remote access to confidential data for vetted users and glued together by (iii) a verification system that allows users to receive feedback on the similarity of analyses based on the synthetic and confidential data. Variants of this approach are being developed by, for example, the Census Bureau for the American Community Survey, the Internal Revenue Service for a synthetic tax filings database, and the National Center for Health Statistics for several of its key data products. This approach also was highlighted in the Ryan-Murray Commission for Evidence Based Policymaking as a way forward for providing access to confidential data. Additionally, in 2015,
The Atlantic published
a story about my research on methods for protecting data confidentiality.
Research on Missing Data
A major thrust of my research has been to extend the theory and
applications of multiple imputation, which was first conceived as a tool
that statistical agencies could use to handle nonresponse in large
datasets that are disseminated to the public. The basic idea is for the statistical
agency to simulate values for the missing data repeatedly by sampling
from predictive distributions of the missing values. This creates multiple,
completed datasets that are disseminated to the public.
I have developed multiple imputation approaches for correcting erroneous or implausible values due to measurement errors. Most recently, I have been developing methods that leverage auxiliary information on population distributions to improve imputation models.
Research on Data Integration
As costs of mounting new data collection efforts increase, many statistical agencies and data analysis are turning to integrating
data from multiple sources. I develop methododology for record linkage, in which analysts attempt to merge records from two or more databases, typically based on
variables that are not unique and may contain error. I work on techniques that account for uncertainty due to inexact matching, especially propagating this uncertainty to downstream regression and causal inferences. I also have worked on data fusion, in which analysts combine information from two or more databases with disjoint sets of individuals and some variables that do not overlap.