In my research, I develop statistical methodologies that are useful for broad classes of applied problems. I also develop mathematical theory that advances the discipline of statistical science itself. Most often, but not always, I derive methodology from Bayesian statistical principles. My main areas of research include statistical disclosure limitation, missing data, data integration, and the analysis of complex data including methods for causal inferences.

Research on Statistical Disclosure Limitation
Many agencies collect data that they intend to share with others. However, releasing the data as collected might reveal data subjects' identities or sensitive attributes. Simple actions like stripping names and addresses may not sufficiently protect confidentiality when the data contain other variables, such as demographic variables or employment/education histories, that ill-intentioned users can match to external data files. Thus, agencies need sophisticated methods to facilitate safe data sharing and dissemination. I work extensively on theories and applications of such methods, including (i) synthetic data in which original data values at high risk of disclosure are replaced with values simulated from probability distributions specified to reproduce as many of the relationships in the original data as possible, (ii) metrics that agencies can use to quantify risks of disclosures and potential loss of information due to confidentiality protection procedures, including both formal privacy and Bayesian probabiilties of disclosure, and (iii) methods of secure data analysis that offer users output from statistical models without allowing them to see the actual data. I am particularly interested in developing applications for government databases; for example, the U.S. Census Bureau disseminates synthetic data for the Survey of Income and Program Participation, the American Community Survey group quarters data, the OnTheMap program, and the Longitudinal Business Database. I have advocated for an integrated system for providing data access, including (i) synthetic data as public use files coupled with (ii) remote access to confidential data for vetted users and glued together by (iii) a verification system that allows users to receive feedback on the similarity of analyses based on the synthetic and confidential data. Variants of this approach are being developed by, for example, the Census Bureau for the American Community Survey, the Internal Revenue Service for a synthetic tax filings database, and the National Center for Health Statistics for several of its key data products. This approach also was highlighted in the Ryan-Murray Commission for Evidence Based Policymaking as a way forward for providing access to confidential data.

Research on Missing Data
A major thrust of my research has been to extend the theory and applications of multiple imputation, which was first conceived as a tool that statistical agencies could use to handle nonresponse in large datasets that are disseminated to the public. The basic idea is for the statistical agency to simulate values for the missing data repeatedly by sampling from predictive distributions of the missing values. This creates multiple, completed datasets that are disseminated to the public. I have developed multiple imputation approaches for correcting erroneous or implausible values due to measurement errors. Most recently, I have been developing methods that leverage auxiliary information on population distributions to improve imputation models.

Research on Data Integration
As costs of mounting new data collection efforts increase, many statistical agencies and data analysis are turning to integrating data from multiple sources. I develop methododology for record linkage, in which analysts attempt to merge records from two or more databases, typically based on variables that are not unique and may contain error. I work on techniques that account for uncertainty due to inexact matching, especially propagating this uncertainty to downstream regression and causal inferences. I also have worked on data fusion, in which analysts combine information from two or more databases with disjoint sets of individuals and some variables that do not overlap.