Thom Volker

Utrecht University
Methodology and Statistics

Private yet accessible: advancing privacy-aware synthetization of sensitive microdata

Microdata held at National Statistical Institutes (NSIs) are a potential gold mine for answering many social science research questions. In recent years, several NSIs have successfully established programs to make microdata available to researchers (e.g., Green, 2021; Hundepool et al., 2012), resulting in novel and innovative substantive research (see, e.g., Hanushek et al., 2021; De Zeeuw et al., 2021). However, granting external researchers access to microdata is costly and time-consuming for the NSI and the researcher, and bears disclosure risk. This impedes the research community from taking full advantage of these gold mines.

Such accessibility problems can be overcome by disseminating synthetic data, that is, a “simulated” dataset that is statistically indistinguishable from the original microdata but without disclosure risks (Rubin, 1993; Little, 1993). If the quality of the synthetic data is sufficiently high, researchers can use it to make inferences without ever needing access to the original microdata (Drechsler, 2011). Even if the synthetic data differs from the original, it serves many purposes, such as data exploration, model testing, educational purposes and data processing pipelines for (unit) tests and open science workflows (Van Kesteren, 2021). The main hurdle holding back practical implementation of data synthesis is the privacy-utility tradeoff (Drechsler, 2011; Burnett-Isaacs et al., 2021): more realistic data may yield higher disclosure risk. Although methods exist to create realistic synthetic datasets with high utility (e.g., Volker & Vink, 2021), the risk of disclosure is often unclear. Formal privacy quantification methodology for synthetic data is still in its infancy (Raghunathan, 2021; Bowen & Snoke, 2019), and user-friendly data synthesis software with privacy quantification does not exist.

The objective of the project is to fill these gaps. We concentrate on the research question: how can we balance the usefulness of synthetic microdata against the risk of disclosure? To answer this question, we define the following four projects:
Project 1: Differential privacy (Dwork, 2006; Oberski & Kreuter, 2020) is among the most popular approaches to quantify disclosure risk. The concept implies that removing one person from a dataset can only change its distribution by some small factor. If true, the disclosure risk is considered minor and acceptable. Recently, differential privacy has been linked to the synthetic data framework (Bowen & Liu, 2020; Liu, 2016), but practical implementations hereof are lacking. We will further the methodology and extend state-of-the-art synthesis software (i.e., mice, synthpop).
Project 2: Past research on statistical models for data synthesis predominantly focused on data utility, but largely neglected disclosure risks. We will empirically compare different synthesis models on the basis of both existing definitions of privacy (e.g., differential privacy, t-closeness) and utility (e.g., general and analysis-specific utility measures). This project will result in methodological guidelines for data synthesis, as well as improved default implementations in synthesis software.
Project 3: Currently, only distinct measures of privacy and data utility exist. We will advance the synthetic data methodology, by formalizing the privacy-utility tradeoff in a single measure. The quantification of this tradeoff can serve as a benchmark in the evaluation of synthesis procedures, such that for a fixed amount of disclosure risk an upper bound of the data utility can be determined.
Project 4: We will apply the results from the previous projects to an end-to-end case study. We will collaborate with Statistics Netherlands to solve practical statistical disclosure issues. In addition, we will identify and solve practical issues that arise when creating synthetic data, and carefully outline the methodology needed by those who will use the synthetic data.

Erik-Jan van Kesteren, Peter-Paul de Wolf, Stef van Buuren

Financed by
Utrecht University, Department of Methodology and Statistics

2022 – 2026