Institution of Psychology

Methodology and Statistics

Faculty of Social and Behavioural Sciences

Leiden University

**Project***Stepwise estimation approaches of growth mixture models*

i. Research Topic and Theoretical Background

The latent curve modeling (LCM) is a statistical method widely used by social and behavior scientists. An important assumption of conventional LCM is that it assumes all individuals are sample from a single observed population (Mo Wang & Bodner, 2007). However, in many applied settings, unobserved subpopulation with different growth trajectories may exist. The growth mixture modeling (GMM) method relax the single population assumption and can identify unobserved subpopulations. Currently, there are two main approaches can be used to estimate parameters for the GMM and its extended models (e.g., incorporating the covariates and/or the distal outcomes), which is the one step and stepwise estimators. In the proposed project, we propose a novel two-step GMM approach that separates the estimation of the measurement and structural model while avoiding the bias that is common to the naive stepwise estimators. The approach can incorporate both predictors and distal outcomes of the growth trajectories using stable and robust stepwise estimation algorithms by breaking down the likelihood function into the measurement and structural model. A further advantage of the two-step estimator is that the samples used in the step one (measurement model) and step two structural model can be (partially) different, while with the simultaneous estimator missing data is more problematic (Vermunt, 2010; Bakk and Kuha, 2018). Similar estimators were successfully introduced for latent class (Bakk and Kuha, 2018), latent markov (Di Mari and Bakk, 2018), and structural equation models (Devlieger et al., 2016), but are not yet available for the more complex GMM models that combine latent class and structural equation models.

ii. Methodological Approach

The general methodology used in this project consists of three steps:

1) Think and Derive;

2) Simulate and Compare;

3) Apply and Analyze.

The first step, think and derive, requires the mathematical derivations of properties of estimators. Therefore, mathematical statistics is used to prove the performance of the methods. As the second step, we perform largescale simulation studies in which we generate data with known properties under a wide range of conditions. This methodology can be used to evaluate the strength and weaknesses of the proposed methods in cases where the assumptions are fulfilled but also when the assumptions are not tenable. Moreover, it makes it possible to confirm the mathematical derivations and evaluate the effect of assumptions. Then, we can clearly say whether the mathematical derivations are applicable for the theory. Lastly in each chapter a real data application (from openly available datasets) will be used to demonstrate the usability of the approach.

iii. Outline (sub-projects, e.g., each forming a chapter in the dissertation)

The whole project can be divided into four subprojects. Each of the subprojects aim to introduce the two-step GMM model and test its properties for different context where the model is relevant, namely for models with covariates (project 1, most common in practice), with distal outcome (project 2). In subproject 3 the robustness of the approach to a key underlying model assumption is tested, and in subproject 4 a set of example applications are showcased in a user-friendly manner. All the subprojects will lead to peer reviewed journal articles.

Project 1: Introducing the two-step GMM model with covariates.

Project 1 will introduce the two-step estimator focusing on models with covariates. The two-step estimator is a pseudolikelihood estimator that was first introduced by Rao (1948) and is common in complex regression models. The approach was already successfully used in latent class (LC), latent markov and structural equation models (SEM). The GMM model combines SEM and LC models (LC). Stepwise estimators are proposed as a go to approach in literature for complex models where simultaneous estimation is likely to fail. The project focuses on the statistical properties of the two-step estimator, the derivation of the parameter estimates and their standard errors. While estimating the parameters should be straightforward (even now the two-step GMM can be estimated in LatentGOLD or Mplus) the difficulty is the standard error estimation of the step-two model. The naïve standard errors ignore the variability in the fixed parameters of the step two model. In project 1 we need to investigate how severe the bias is in SEs with the naive estimator, and parallel to this also try to derive analytically correct SEs. In case analytical SEs are not possible to be derived the alternative is using bootstrap SEs. In short, the main RQ of project one is about finding the right SE estimator, as parameter estimation will not pose any problems. From simpler models we know that the bias in SEs is only problematic with small sample sizes ad weak measurement models (Bakk, Oberski, Vermunt, 2014)- as such the worth case scenario is that for those situations we cannot recommend the approach, or explicitly state that a downward bias of SEs is problematic in those cases. The relevance of the simulation study is in finding the point from where that bias is significant, and thus have clear recommendation for researchers in what situations to not use GMM or two-step GMM.

Project 2: Two-step GMM models with distal outcomes.

This project extends on project 1 to explore the situations with distal outcomes of the GMM. Derivations of parameter estimates, and their standard errors will be provided next to a simulation experiment checking the performance of the proposed estimator.

Project 3: Two-step GMM in the presence of differential item functioning.

Differential item functioning refers to the situation when there is a direct effect of an external variable on an item, for example an item is easier for boys than girls. Ignoring DIF is known to bias the parameters of both the structural and measurement model of GMMs. We will test the performance of residual statistics such as score tests and overall fit measures for the proposed two-step estimator and existing alternatives. In an extended simulation study, we investigate the bias and efficiency of the two-step estimator when accounting for DIF, and the consequences of ignoring DIF using the 3 approaches.

Project 4: An applied step by step guide with open code and multiple datasets that will be submitted to a peer reviewed journal such as the “Teacher corner” of the journal Structural equation Modeling. The last project will create a user-friendly tutorial for using the two-step estimator based on a set of real data examples. The tutorial will show how to implement the two-step GMMs in the R package ’flexmix’ and the commercial software packages Mplus and LatentGold. It should be mentioned that inherent to any statistical approach is that it can work suboptimal on complex real data when underlying model assumptions are violated. This can also be the case with the current estimator. The risk will be mitigated by examining the properties of the estimator in different simulated setups that are known to be difficult in the first 3 projects. Furthermore in project 4 we can specifically search for real data examples that are messy (for example many missing data, skewed distribution of the distal outcome) to stress test the approach.

**Supervisors**Prof. dr. M.J. de Rooij

Dr. Z. Bakk

Dr. E.M.M. McCormick

**Period**

1 October 2023 – 30 September 2027