In my research, I develop statistical methodologies that are useful for broad classes of applied problems. I also develop mathematical theory that advances the discipline of statistical science itself. Most often, but not always, I derive methodology from Bayesian statistical principles. My main areas of research include statistical disclosure limitation, missing data, data integration, and the analysis of complex data including methods for causal inferences.

Research on Statistical Disclosure Limitation
Many agencies collect data that they intend to share with others. However, releasing the data as collected might reveal data subjects' identities or sensitive attributes. Simple actions like stripping names and addresses may not sufficiently protect confidentiality when the data contain other variables, such as demographic variables or employment/education histories, that ill-intentioned users can match to external data files. Thus, agencies need sophisticated methods to facilitate safe data sharing and dissemination. I work extensively on theories and applications of such methods, including (i) synthetic data in which original data values at high risk of disclosure are replaced with values simulated from probability distributions specified to reproduce as many of the relationships in the original data as possible, (ii) metrics that agencies can use to quantify risks of disclosures and potential loss of information due to confidentiality protection procedures, and (iii) methods of secure data analysis that offer users output from statistical models without allowing them to see the actual data. I am particularly interested in developing applications for government databases; for example, the U.S. Census Bureau uses synthetic data to disseminate data from the Survey of Income and Program Participation, the American Community Survey group quarters data, the OnTheMap program, and the Longitudinal Business Database.

Research on Missing Data
A major thrust of my research has been to extend the theory and applications of multiple imputation (MI), which was first conceived as a tool that statistical agencies could use to handle nonresponse in large datasets that are disseminated to the public. The basic idea is for the statistical agency to simulate values for the missing data repeatedly by sampling from predictive distributions of the missing values. This creates multiple, completed datasets that are disseminated to the public. Recently, I have been working on multiple imputation methods for correcting erroneous or implausible values.

Research on Data Integration
As costs of mounting new data collection efforts increase, many statistical agencies and data analysis are turning to integrating data from multiple sources. I develop methododology for record linkage and for data fusion. In record linkage, analysts attempt to merge records from two or more databases, typically based on variables that are not unique and may contain error. I work on techniques for accounting for uncertainty due to inexact matching. In data fusion, analysts combine information from two or more databases with disjoint sets of individuals and some variables that do not overlap. I work on techniques for incorporating auxiliary information, such as data from online polls, to improve data fusion inferences.