MTO, Tilburg School of Social and Behavioral Sciences

Tilburg University

**Supervisor
**Prof. J.K. Vermunt

On March 2nd, 2018, Davide Vidotto will defend his thesis entitled

**Summary thesis**

This dissertation investigates the use of *latent class* (or *mixture*) models for Multiple Imputation (MI). MI is a technique that enables the retrieval of parameter estimates and the performance of statistical inference in the presence of missing data in a dataset. While missing data may represent an issue for standard statistical analysis (e.g., they can introduce bias and loss of power in the final or substantive analysis), MI seeks to fix the problem by replacing the missing data with plausible imputed data, predicted by means of an imputation model. Repeating the replacements (or imputations) several times allows the uncertainty of the imputed values to be taken into account, and leads to valid inferences.

In this context, the choice of the imputation model is crucial: it should not only preserve all the relevant relationships needed for a specific analysis of interest (e.g., the main effects of a regression reflect the relationships between the outcome and the predictors), but it should be able also to reflect overall relationships present in the data, in such a way to allow to carry out further analyses with other (more complex) kinds of associations (e.g., interaction terms represent the simultaneous relationship between two predictors and the outcome). Thus, in MI we are interested in the predictions produced by the imputation model – and how they reflect relationships among variables – rather than in interpreting its parameter values. The broader the imputation model, the better it can capture important relationships in the data. As a consequence, overfitting the data with the imputation model is of smaller concern than underfitting: while an underfitting model might ignore important relationships of the data, an overfitting one takes into account all relevant relationships, as well as sample-specific fluctuations. As a result, in the former case the model could produce too poor imputations, while in the latter case the relevant relationships are preserved by the model.

The thesis deals in particular with the MI of missing categorical data; while methods for continuous data have been extensively explored, in the literature there is a lack of MI models for categorical data. With categorical data, the focus is on retrieving relevant associations in the joint distribution of the categorical variables of a dataset. The saturated log-linear model, which takes into account all theoretically possible associations of the data, is a typical choice in this context. However, saturated log-linear models are computationally appealing only with a small number of items. As a solution, recent proposals for the MI of categorical data include the use of either latent class analysis (frequentist framework) or the Dirichlet Process Mixture of Multinomial Distributions (Bayesian framework) as imputation models, which both belong to the family of mixture models. Unlike MI via saturated log-linear models, MI through latent class models can be performed on datasets containing a large number of variables by means of the \textit{local independence} assumption, which assumes independence between variables once their distribution is conditioned on the latent classes.

In order to reflect all the necessary variability for the imputations, the imputation model should be tailored for the design used to collect and analyze the data. For instance, cross-sectional data need a model that takes all relevant associations among items into account; with multilevel data, in which several lower-level units are nested within higher-level units (such as students nested within schools), correlations and dependencies arising from units of the same group must be also accounted for; with longitudinal data, variables are observed over time for the same units, and auto-correlations and lagged relationships are likely to arise. Ignoring these aspects of the data may lead to underfitting and, as a consequence, to biased (and/or too stable) post-imputation inferences. The purpose of this thesis is to propose and investigate different types of latent class models for the MI of categorical data; each of these types of models are tailored for the design chosen for the data collection and analysis. Thus, Chapter 2 of the thesis offered a review of the latent class models present in the literature for the MI of cross-sectional categorical data. Chapter 3 investigated in detail the behavior of Bayesian latent class models for the MI of cross-sectional data. Chapter 4 examined the behavior of Multilevel latent class models for the MI of multilevel data. Lastly, Chapter 5 assessed the performance of the Mixture latent Markov model for the imputation of longitudinal data. Simulation and empirical studies reported in the chapters show good behavior of the imputation models under analysis, in terms of bias and coverage rates of the substantive models. The imputation models presented in the thesis have been developed under a Bayesian framework and estimated by means of the Gibbs sampler. Bayesian analysis is well-suited for MI, since it automatically accounts for the variability caused by both the missing data distribution and the parameter uncertainty. Another purpose of the thesis was to find a way to perform model selection which is suitable for MI. With mixture models, model selection is equivalent to detecting the number of components (or classes) to be used at the imputation stage. To achieve this, we exploited a feature of the Gibbs sampler run in combination with mixture models: with a preliminary run of the sampler (and with a particular setting of the prior distribution of the mixture components), it is possible to obtain a (posterior) distribution of the number of classes actually occupied by the data. As a general approach, we chose the maximum of this distribution in order to perform the imputations, in such a way to use the broadest possible imputation model.

Several extensions of the models proposed in this dissertation are possible. The main one concerns the measurement scale of the variables assumed by the models: while in social and behavioral sciences categorical scales are frequently used in questionnaires, variables measured with mixed types of scales (i.e., continuous and categorical) can be frequently found in different contexts. The mixture models described above can be easily modified to accommodate for both kinds of measurement scales (e.g., by assuming mixtures of Normal and Multinomial distributions), but their performance must be evaluated in future research. Multilevel latent class models can also be adjusted to account for more than two levels in the hierarchy, while mixture latent Markov models can be extended to include second or higher-level orders of lagged relationships.

**Project**

**Multiple imputation of nested missing data using extended latent class models**

Social science researchers often make use of multilevel and longitudinal data sets, in which the occurrence of missing data is a well known problem. To prevent biased results, it is important to deal with missing data in an appropriate manner. This project develops new multiple imputation methods for such nested missing data using extended latent class models. These models can capture dependencies between observations within groups (within individuals in the longitudinal case) and deal with (possibly missing) variables measured at different hierarchical levels. The new imputation methods will be implemented in a freely available R package.

**Financed by
**NWO (Research Talent)