Tilburg School of Social and Behavioral Sciences
– prof.dr. K. Sijtsma (Tilburg University)
– dr. A.A. Béguin (Cito, Arnhem)
On December 12th 2014, Marie-Anne Mittelhaëuser defended her thesis entitled
Modeling the Effect of Differential Motivation on Linking Educational Tests
Application of mixed IRT models and person-fit methods in educational measurement
Item response theory (IRT) models have specific properties that are useful in educational measurement. These properties support the construction of measurement instruments, linking and equating of measurements, and evaluation of test bias, among other things (Scheerens, Glas, & Thomas, 2007). However, these properties are only useful if the IRT model fits the data and if the proficiency level and item parameters are accurately estimated. Unfortunately, due to various reasons, this condition is not always met. For example, if groups of respondents display “sleeping” behavior (e.g., inaccurately answering the first items in a test due to problems getting started), “plodding” behavior (e.g., spending too much time on the first items and thereby answering the later items incorrect due to too little time left), random response behavior (e.g., answering items randomly) or cheating behavior (e.g., copying answers from other examinees) an IRT model might not fit to specific subgroups of respondents within the total group (Meijer & Sijtsma, 2001; Meijer, 2003).
Several methods were proposed to identify these aberrant response behaviors. For example, person-fit methods assign a value to each individual vector of items scores, and a statistical test is used to decide whether the underlying IRT model or other measurement model fits the item scores. Significant person-fit values identify item-scores that are aberrant relative to the IRT model, and the researcher may decide to remove the aberrant item-score vectors from the data set (Meijer & Sijtsma, 1995). This is expected to improve the fit of the IRT model and the correctness of the parameter estimates. A well-known person-fit statistic is the lz, statistic (Drasgow, Levine, & Williams, 1985). Research showed that the normal approximation to lz is invalid, which yields a conservative test, in particular for detecting aberrant responses at the lower and higher end of the level scale and when applied to short scales (Van Krimpen-Stoop & Meijer, 1999). Fortunately, Snijders (2001) and De la Torre and Deng (2008) developed methods for the accuracy of person-fit analysis using lz.
Alternatively, mixed IRT models assume that the data are a mixture of different data sets from two or more latent populations (Rost, 1997; Von Davier & Yamamoto, 2004), also called latent classes. If this assumption is correct, a particular IRT model does not hold for the entire population, but different model parameters are valid for different subpopulations. Hence, mixed IRT models may be used to identify classes in our data displaying different types of responsive behavior, and the researcher may decide to remove an entire class from the data set so as to improve IRT model fit and parameter estimates. For example, one can specify the mixed IRT model in such a way that one of the latent classes represent high-stakes response behavior while the other latent class represents low-stakes responsive behavior (Béguin, 2005; Béguin & Maan, 2007).
The goal of this project is to investigate how mixed IRT models and person-fit methods can be used to improve educational measurement procedures. More specifically, research is done into equating and linking procedures in which two high-stakes tests are compared.
This project was financed by Tilburg University and Cito.