Ilaria Lunardelli

Department Methodology & Statistics
Tilburg School of Social and Behavioral Sciences
Tilburg University

Email
Website

Project
Combining estimates based on multiple datasets

For decades, survey samples have been considered the gold standard for producing reliable population estimates. In recent years, however, survey response rates have declined significantly (Luiten et al. 2020), reducing effective sample sizes and, consequently, the precision of estimates (Rogelberg et al. 2007). Moreover, conducting surveys has become increasingly costly and time-consuming compared to using existing non-probability data sources, such as administrative records and opt-in web surveys (Baker et al. 2013, Cornesse et al. 2020).
National Statistical Institutes (NSIs), including Statistics Netherlands, hold multiple datasets that describe the same phenomena. For instance, information on health may be found in survey data, hospital administrative records, and opt-in web surveys. Each of these sources has distinct advantages and limitations:

• Survey samples are generally representative (at least after appropriate reweighting) but often small, resulting in unbiased yet inefficient estimates.
• Administrative datasets are typically large but selective, and their variable definitions may differ from official statistical standards, introducing measurement error.
• Opt-in web surveys can be very large but tend to suffer from strong selectivity and measurement error, often more so than administrative data.

Individually, none of these data sources is sufficient for producing accurate and efficient population estimates. When combined, however, it becomes possible to exploit their complementary strengths, improving efficiency, reducing bias, and enhancing accuracy. Recent research (e.g., Elliott & Valliant, 2017; Liu et al., 2023) has provided valuable methodological advances in this area, but further development is needed to extend these methods to broader, and more complex real-world settings.
The aim of this PhD project is to develop, enhance, and evaluate methods for combining estimates derived from diverse data sources, each with its own quality characteristics, to produce more accurate, efficient, and less biased estimates than any single source can provide.
Specifically, the project focuses on four interrelated objectives:

1. Correcting for selectivity when a target variable is observed in a non-probability sample, including cases where inclusion depends on the variable itself.
2. Combining estimates from non-probability and probability samples, extending existing frameworks to more general variable types and relaxing assumptions about sample representativeness.
3. Estimating relationships between target variables observed in different datasets, by advancing statistical matching techniques that account for minimal dataset overlap.
4. Exploring practical implementation, identifying the conditions under which the developed methods perform best and providing guidelines for their application in official statistics production.

This research is highly relevant for modern NSIs, which increasingly rely on integrating multiple imperfect data sources to produce high-quality and timely statistics. The developed methods will enable statistically sound data integration, correct for selectivity and measurement errors, and strengthen the accuracy and reliability of official statistics – key priorities for the future of NSIs.

This PhD project is also well aligned with the mission of IOPS, as it revolves around crucial challenges such as selectivity and representativeness, which are highly relevant, and it will produce methodologies that are useful to the research community

Supervisors
Prof. Dr. Ton de Waal
Dr. Lizbeth Burgos Ochoa
Dr. Sander Scholtus

Financed by
Statistics Netherlands

Period
1 October 2025 – 30 September 2029