Florian van Leeuwen

Methodology and Statistics
Social and Behavioural Sciences
Utrecht University

Email
Website

Project
Getting the best predictions from complex data sets: Balancing scalability and interpretability

Recent technological advances have revolutionized the collection of data in various fields, such as healthcare and psychology. For instance, genome sequencing, brain activity scanning, and experience sampling methods now provide researchers with extensive information on multiple variables across numerous timepoints. Furthermore, the increasing emphasis on open science has made high-dimensional data sets more accessible through online platforms. This wealth of information presents researchers with the opportunity to address increasingly complex research questions. However, the existing data analysis methods are inadequate: traditional statistical approaches struggle to extract meaningful patterns from noisy, high-dimensional data while novel machine learning techniques offer improved prediction accuracy but lack interpretability.  Traditional statistical methods have historically focused on explanatory modeling; developing and testing statistical models based on theory to, ultimately, draw causal conclusions (Shmueli, 2010). However, as the dimensionality of the data increases, it becomes increasingly challenging to establish causal relationships due to the lack of comprehensive theoretical knowledge for all variables in the dataset. Consequently, many researchers have shifted their focus towards prediction as the primary objective of analysis.

A widely-used and effective statistical approach for achieving accurate prediction is through the use of penalized regression methods. These techniques are designed to strike a balance between the bias and the variance of the estimated parameters by pulling small effects towards zero, while maintaining large effects at their original magnitude. Two popular forms of penalization are the ridge penalty (Hoerl and Kennard, 1970) and the least absolute shrinkage and selection operator (lasso; Tibshirani, 1996). Bayesian penalized regression offers similar effects to these methods while providing several added advantages. These advantages include automatic uncertainty estimation, simultaneous estimation of the penalty parameter, and more flexible shrinkage behaviors (van Erp, Oberski, and Mulder, 2019). This approach thus offers a more comprehensive and robust analysis framework.

Machine learning techniques are thought to further improve prediction accuracy. Their popularity has surged in recent years, with a four-fold increase in the usage of machine learning-related keywords in social science abstracts from 2017 to 1960-2017 (from 0.63% to 2.34%; Rahal et al., 2022). However, this exclusive focus on prediction presents a significant drawback: the resulting models are often complex and difficult to interpret. Consequently, policymakers and decision-makers are skeptical of machine learning results, and researchers face challenges in performing inference and gaining new knowledge from these models.

For example, adversarial models, particularly Generative Adversarial Networks (GANs) introduced by Goodfellow et al. (2014), have heralded a paradigm shift in the modeling of complex data distributions. These models consist of two neural networks—the generator and the discriminator—that are trained together. The generator tries to create data that is indistinguishable from real data, while the discriminator tries to distinguish between the two. Through this adversarial process, the generator becomes increasingly adept at producing data that mimics the real distribution and – by proxy – in modeling the observed data distribution. GANs have shown incredible capability in capturing intricate patterns in high-dimensional spaces, making them apt for datasets where traditional methods may falter (Arjovsky et al., 2017). While the GAN architecture leads the domain of generative modeling, its application to tabular prediction settings with mixed data types remains largely unexplored. GANs have known challenges like extensive data requirements, loss of explainability, lengthy computation times and a need for intricate tuning. Furthermore, GANs currently lack easy-to-perform inferences for tasks like marginalization and conditioning, which are crucial for probabilistic reasoning. To this end, Adversarial Random Forests (ARFs; Watson, 2023) merge traditional machine learning techniques with deep learning methodologies to solve many of the challenges with GANs. By combining the adversarial modeling nature with random forests, ARFs can effectively handle mixed data in tabular setups while remaining efficient on large data sets. ARFs can outperform deep learning models like Variational Autoencoders and GANs in certain – but not all – metrics, with execution speeds up to 100 times faster (Watson, 2023). Most importantly, ARFs can easily be transformed to allow for probabilistic inference. However, while individual trees are interpretable, an ARF consists of many trees, which might make it challenging to gain a cohesive, singular explanation of the model’s behavior.

Researchers now face a dilemma: they can either rely on traditional statistical methods, which offer explanation and inference but provide suboptimal predictions, or embrace novel machine learning methods for optimal predictions but sacrifice interpretability, inference, or both. The objective of this project is to bridge this gap by addressing the following questions:

  1. How can traditional statistical methods be optimized to provide the best predictions in high-dimensional datasets?
  2. How can we effectively apply and understand modeling in state-of-the-art machine learning methods to realistic datasets?
  3. How can we incorporate prior knowledge to enhance both predictive accuracy and interpretability?
  4. How do the resulting methods compare to each other in terms of their performance and utility?

By answering these questions, this project aims to elucidate the potential and limitations of traditional statistical methods and novel machine learning techniques when it comes to addressing research questions based on complex data sets. As a result, we will provide applied researchers with clear guidance on how to extract all information from the wealth of data they have available.

Topics and open questions that fall within the scope of the project include but are

Supervisors
Prof. Dr. Ellen Hamaker
Dr. Sara van Erp
Dr. Gerko Vink

Financed by
Utrecht University

Period
1 June 2024 – 1 June 2028