Prof. Rob Meijer & Dr Jorge Tendeiro
Practical Implications of the Misfit of Item Response Theory Models
In education, psychology, and health research item response theory (IRT) is increasingly used to construct tests and to evaluate the psychometric quality of existing tests. Due to the increasing availability of easy-to-use software, not only test developmental companies, but also researchers use IRT for the development and evaluation of tests and questionnaires. Before an IRT model can be applied, researchers should report fit measures to show that the models they use describe the data fairly well so that, for example, estimated theta levels can be trusted. However, IRT models and their underlying assumptions represent ideals about data that do not exist in practice. As Funder (1997) discussed in a more general. context: “There are only two kinds of data: Terrible data that are ambiguous, potentially misleading, incomplete, and imprecise. The second kind is No Data”. Because IRT models never fit the data, a researcher who applies these models is left in uncertainty which model to choose and whether model choice makes a significant difference for practical decisions that are based on the application of these models. Additional parameters can be added to the model resulting in more complex models that may have a better fit, but require more complex parameter estimation, and these parameters may be less stable under replications. For practical decisions should a researcher prefer the more complex models or does the worse fit of simple models have minor influence on the practical decisions that are being made in practice?
What researchers and practitioners badly need is evidence about the stability of the main conclusions of empirical educational research in which IRT models are being used (Molenaar, 1997). Some of the research problems that this project tries to address may be summarized as follows:
- Is there a difference in the main conclusions derived from an instrument (e.g., a test or a questionnaire) with or without bad items in the test?
- Is there a difference in the main conclusions derived from an instrument with or without misfitting item score patterns?
- If there are differences: How large and how consequential are they?
In this project we will investigate the practical significance of misfit of IRT models, that is, \the extent to which the decisions made from test scores are robust against the misfit of the IRT models (Sinharay & Haberman, 2014). The main aim of this project is to investigate, through both simulated data and empirical data, whether the main conclusions in empirical research hold under different IRT models and under different violations of IRT models.
University of Groningen
1 September 2015 – 1 September 2019