Giorgio Spadaccini

Department Methodology and Statistics
Faculty Social Sciences
Leiden University

Project
Enhancing the use and interpretation of tree-based prediction models

Since access to more powerful computers has become more widespread,
statistical and machine learning models have been able to reach new levels
of complexity, and in doing so solve more difficult tasks with higher
precision. As these models typically aim at maximum predictive accuracy,
they typically lack both interpretability and inferential tools, which are
desireable both on a societal level, for more communicable and
transparent research, and on a scientific level, for clearer insight into the
relationship between predictors and outcome. Moreover, most statistical
models cannot incorporate human concepts such as ethics, legality or
justice in the training process. The “Correctional Offender Management
Profiling for Alternative Sanctions” (COMPAS), for example, a software used
in the state of New York for more than a decade, was shown to have
racially biased estimations for the risk of committing a crime [1][2]. As it is
essential to guarantee that statistical estimations are as free as possible

from such biases, the interpretability of a model is key to performing post-
hoc checks needed to ensure that a model adheres to human standards

and principles.
Prediction Rule Ensembles (PRE, [3][4]) aim at turning tree ensembles into
interpretable models through means of LASSO linear regression, whose
strong sparsity increases the transparency and interpretability of the
model, while retaining most of the accuracy and capability to capture
nonlinear and interaction effects. The use of Bayesian estimation will likely
improve predictive performance and stability of RuleFit, compared to
frequentist LASSO, thereby improving an interpretable Machine Learning
method. Further, it will allow for developing valid uncertainty
quantification. We aim to propagate this uncertainty into Shapley values,
enabling inference for explainable Machine Learning.
The project is well suited for the Interuniversity graduate school of
Psychometrics and Sociometrics. In particular, its relevance can be seen in
both science and evidence-based practice, where
• Scientists want to understand how variables affect the outcome. They
want to test and develop their theories, requiring statistical inference.
• Practitioners (policymakers, doctors, patients) want to know why the
model made a given prediction, or they need to memorize decision rules.
• Anti-discrimination laws require effects of variables like gender, age and
race to be transparent and data protection regulations enforce individuals’
rights to explanation of data-driven decisions.
Within this framework, our project aims at producing an interpretable
Machine Learning prediction method whose state-of-the-art accuracy will
allow to avoid biases entering data-driven decisions made in diagnostic,
treatment and educational settings, while also granting inference which will
allow for better theory testing and development for the psychological
sciences.

Supervisors
Dr. M. Fokkema
Prof. M. van de Wiel

Period
1 November 2023 – 31 October 2027