Sebastián Mildiner Moraga

Methodology and Statistics
Faculty of Social and Behavioural Sciences
Utrecht University

Academic webpage Sebastián Mildiner Moraga

Project

The multilevel explicit-duration hidden Markov model for real time behavioural data

Due to technological advances, it becomes increasingly easy within the social sciences to collect data on behaviour as it unfolds in real time, measured for a prolonged period of time. Take for instance the interaction between a therapist and a patient: we can automatically annotate different types of nonverbal communication based on a video recording, for every second for a period of 15 minutes. Other methods that can be used for collecting real time behavioural data are, for example, experience sampling, GPS tracking, and accelerometer data. These new data enable a novel perspective on investigating behaviour: studying the dynamics of behaviour over time. This in contrast to the static summaries of behaviour that are currently typically obtained.

To extract the dynamics of behaviour over time, the statistical model of choice is the hidden Markov model (HMM; Rabiner, 1989; Zucchini, Macdonald, & Langrock, 2017). HMMs are a machine learning method that have been used for several decades in many different scientific fields, such as speech recognition (Rabiner, 1989), human activity recognition (Ronao & Cho, 2017), animal behaviour (Bode & Seitz, 2018), DNA labelling (Rueda, Rueda, & Diaz-Uriarte, 2013), among others. Within the social sciences, the HMM is still a rarely used statistical method. When applied to real time behavioural data, it enables one to extract latent (i.e., hidden) behavioural states over time – based on one or several dependent variables – and model the dynamics of behaviour over time. This model shows great potential for application to data collected within the social sciences, and answering new research questions.

To make the HMM the perfect match for real time behavioural data, the conventional HMM must be extended in two ways. First, the HMM is extended to the multilevel framework such that we can model the observed sequences of multiple subjects simultaneously, and are able to investigate how the dynamics in behaviour are influenced by covariates (e.g. de Haan-Rietdijk et al., 2017; Shirley, Small, Lynch, Maisto, & Oslin, 2012; Altman, 2007). That is, the conventional HMM is typically used to analyse only one long sequence of data, such as one string of DNA or one speech sequence. Second, the durations of the latent behavioural states need to be explicitly modelled and allowed to deviate from a geometric distribution by using an explicit duration HMM (ED-HMM; Guedon, 2003). In the conventional HMM, it is implicitly assumed that a shorter duration of a (behavioural) state is always more probable than a longer duration, which is not a very good match with behavioural data. The ED-HMM within the multilevel framework (in which all model parameters are allowed to be random) is a novel method and not yet described in literature, and is a viable method as shown by extensive preliminary results (Aarts, Dolan, & Van Der Sluis, 2016).

Research problem

Preliminary work by by Aarts et al. showed the viability of the multilevel ED-HMM. Yet, its current implementation is computationally very demanding. This makes the model unfit for a regular use. To ensure user friendly times, the estimation core of the algorithm will have to be re-worked. Once this issue solved, an R package will be made available to apply the multilevel ED-HMM. A better understanding of the capabilities and the limitations of the model will be achieved by extensive simulation studies. These results will also lead to user-friendly guidelines covering aspects such as sample size, statistical power, and the inclusion of covariates in the model.

Goals

1. Improve the algorithm for estimating the multilevel ED-HMM to reduce computational intensity while maintaining robust and unbiased estimation performance.

2. Develop a user friendly and open source software package such that applied researchers can use the developed statistical method.

3. Investigate on how many subjects’ observational sequences should be collected and how long these observational sequences should be when applying the multilevel ED-HMM.

4. Investigate how the required sample size of the ED-HMM depends on the complexity (e.g., number of hidden states, number of dependent variables) of the data.

5. Compare the multilevel ED-HMM and the multilevel HMM on their modelling capabilities.

Methods

The multilevel ED-HMM is implemented within the statistical package R (and partly in C++). Regarding the specific research questions:

  • (1) A literature study will be conducted to narrow down optimal possibilities to improve the algorithm for estimating the multilevel ED-HMM. A small number of possibilities will be implemented and tested using simulation studies.
  • (2) An official R package will be developed, including an extensive tutorial and workshop.
  • (3, 4 & 5) Simulation studies will be conducted.

Supervisors
Prof. dr. I.G. Klugkist, dr. E. Aarts

Financed by
Utrecht University

Period

1 September 2010 – 1 August 2023