Faculty of Social and Behavioral Sciences

Methodology and Statistics

Tilburg University

**Project**

**Correcting for selectivity in datasets**

Vast amounts of data based on large sample sizes do not guarantee that we can have the truth at hand. One and probably the most important reason is that datasets can be selective. That is, the distribution of the sample is not the same as the distribution of the population since the dataset has no probability sampling scheme with known inclusion probabilities. Even with probability sampling, the occurrence of nonresponse may also result in a selective sample. If we apply statistical methods to a massive but selective dataset, we tend to be too confident about the (possibly biased) estimate and we may be unable to cover the real value (Meng, 2018).

To illustrate how severe this problem might be, Meng (2018) models bias of an estimated population total with (1) data quality: the correlation of the target variable Y and participation mechanism, (2) data quantity: square root of (1-f)/f where f = n/N, with n the size of the dataset and N the population size, and (3) problem difficulty: square root of population variance of Y. When the target variable Y and participation mechanism are correlated, the estimated population total is biased, even for large datasets. This research aims to compare and improve existing methods for selectivity correction under different scenarios. The following four topics will be studied.

(1) Correction Methods Based on “Traditional” Statistics

(2) Correction Methods Based On Machine Learning

(3) Selecting Appropriate Auxiliary Variables

(4) Application to Real Data

The most promising approach will be applied to real data to reflect the practical usage. We aim to provide a guideline for researchers when dealing with a selective dataset, and assist researchers to obtain an intuition of the data at hand, for example, estimating the possible selective bias range. We will also try to cover other practical issues such as multiple target variables, under or overcoverage, and measurement error.

**Supervisors**

prof. dr. A. G. de Waal, dr. K. van Deun

Financed by

**Period**