Project
Measuring the Quality of Big Data and Administrative Data
“Which one should I trust more: a 1% survey with 60% response rate or a self-reported administrative dataset covering 80% of the population?”. This is the root question from the article Statistical Paradises and Paradoxes in Big Data (i): Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election by Professor Xiao-Li Meng. Besides its academic relevance, the quality of estimations with non-probabilistic data is currently a central matter for researchers and Official Statistics, given the evermore presence of this type of data. For instance, national statistics institutes frequently use self-reported administrative data sets which follow virtually the same sampling mechanisms that Big Data. Similarly, researchers in social sciences use evermore data from social networks (i.e., Twitter, Reddit) and alternative sources.
A major difference between traditional survey data, and Big Data and administrative data is that sample surveys are collected using a well-specified sampling design, where the inclusion of each population unit in the sample is based on a known probability. In contrast, the sample inclusion probability of units is generally unknown in administrative and Big Data sources. This is an essential feature since the information about the sampling design allows one to correct for selectivity in the collected data, obtain accurate population estimates for quantities of interest, and assess the quality of these estimates. For administrative datasets and Big Data, often little is known about the mechanism that leads to selectivity in the collected data making it very complicated to measure the effects of selectivity and correct for it.
Estimates based on selective Big Data or administrative datasets may be severely biased, and the quality of these estimates is hard to assess. A well-known example is the 2016 US presidential election, where pre-election polls predicted a victory for Mrs. Clinton. However, these polls were selective due to underrepresentation of Trump supporters. At national statistical institutes, the situation can eventually be even worse; for the 2016 US presidential election, we ultimately learned the results, whereas for data provided by a national statistical institute we may never learn the true values of the quantities of interest and invalid estimates and conclusions based on these data can remain unchallenged for a long period.
The proposed research centers on the measurement of the quality of non-probabilistic data, more specifically on Big Data and administrative registers, which are ever more accessible to social science researchers, and the public in general. Thus, within the topic of data quality measurement, there are four distinct sub-topics relating each to one of the papers that will be produced during the Ph.D.: (i) consolidation, formalization, and assessment of the already existing approaches, (ii) extension of the existing approaches by incorporating measurement error, (iii) extension of the existing approaches by incorporating additional sources of error, and (iv) testing of the extended framework on Big Data and administrative data.
The greatest strength of the proposed research is its scientific relevance, as it can enormously improve the quality of many scientific endeavors that are conducted currently. A substantial number of research projects nowadays are using Big Data and administrative data, given the rise in costs, the cumbersome collection process, and the non-response rates that pervade survey data. However, these new data sources and their retrieving process are often not examined thoroughly by researchers. For example, there are several papers who limited their data section to mention that they retrieved 9 million tweets from the Twitter app in a given time window, omitting any details on the data cleaning and storing process. The oversight of the measurement details and the sampling mechanism can lead to biased results, which provide little, if not null, valid conclusions about real-life situations. Thus, evaluating the quality of the new and diverse data sources employed in social science research is a crucial topic for the scientific community nowadays.
Supervisors
Dr. Dimitris Pavlopoulos
Dr. Reinoud Stoel
Dr. Arnout van Delden
Dr. Ton de Waal
Financed by
Statistics Netherlands
Period
March 2022 – March 2026