In my research, I develop statistical methodologies that are useful
for broad classes of applied problems. I also develop mathematical
theory that advances the discipline of statistical
science itself. Most often, but not always, I derive methodology from Bayesian
statistical principles. My main areas of research
include statistical disclosure limitation, missing data, data integration, and the analysis of
complex data including methods for causal inferences.
Research on Statistical Disclosure Limitation
Many agencies collect data that they intend to share with others.
However, releasing the data as collected might
reveal data subjects' identities or sensitive attributes. Simple
actions like stripping names and addresses may not sufficiently
protect confidentiality when the data contain other variables, such as
demographic variables or employment/education histories, that
ill-intentioned users can match to external data files. Thus, agencies need
sophisticated methods to facilitate safe data sharing and dissemination. I work extensively
on theories and applications of such methods, including (i) synthetic
data in which original data values at high risk of disclosure are replaced with values
simulated from probability distributions specified to reproduce as many
of the relationships in the original data as possible, (ii) metrics that agencies
can use to quantify risks of disclosures and potential loss of information due to confidentiality protection
procedures, and (iii) methods of secure data analysis that offer users output from
statistical models without allowing them to see the actual data. I am particularly interested in
developing applications for government databases; for example, the U.S. Census Bureau uses synthetic data
to disseminate data from the Survey of Income and Program Participation, the American Community Survey group quarters data, the OnTheMap program, and the Longitudinal Business Database.
Research on Missing Data
A major thrust of my research has been to extend the theory and
applications of multiple imputation (MI), which was first conceived as a tool
that statistical agencies could use to handle nonresponse in large
datasets that are disseminated to the public. The basic idea is for the statistical
agency to simulate values for the missing data repeatedly by sampling
from predictive distributions of the missing values. This creates multiple,
completed datasets that are disseminated to the public.
Recently, I have been working on multiple imputation methods for correcting erroneous or implausible values.
Research on Data Integration
As costs of mounting new data collection efforts increase, many statistical agencies and data analysis are turning to integrating
data from multiple sources. I develop methododology for record linkage and for data fusion. In record linkage, analysts attempt to merge records from two or more databases, typically based on
variables that are not unique and may contain error. I work on techniques for accounting for uncertainty due to inexact matching.
In data fusion, analysts combine information from two or more databases with
disjoint sets of individuals and some variables that do not overlap. I work on techniques for incorporating auxiliary information, such as
data from online polls, to improve data fusion inferences.