Prof. Herbert Hoijtink, Prof. Gunter Maris
On May 13th 2016, Maria Bolsinova defended her thesis entitled
In educational measurement, data obtained using educational tests are gathered both for practical and scientific purposes, for example for individual assessment or to study the effects of educational policies. While these data often have a very complex structure, we try to capture their most important aspects with relatively simple models. The reason is that statistical models are needed to make inferences about the unobservable constructs of interest (e.g., reading ability, foreign language proficiency, and arithmetic ability) on the basis of observed test data. The dissertation presents various contributions to item response theory in educational measurement which in one way or another search for an optimal balance between simple models and complex reality. The dissertation consists of two parts:
Part I presents contributions to modeling response time and accuracy and Part II presents Bayesian contributions to IRT.
New applications of Rasch modelss in educational measurement
Project 1: “Unmixing Rasch scales”
One of the most popular IRT models in educational measurement is the Rasch model [RM]. It models the probability of answering an item correctly by using only two paramaters: one for the item and one for the person. The main advantage of the Rasch model is that it has a sufficient statistic for person parameters and a sufficient statistic for item parameters. This is important for both estimation of the parameters and interpretation of test results.
However, the RM is often too restrictive to fit the data. First, it assumes unidimensionality of the test. This means that the test measures only one latent trait which explains responses of persons to items. Second, all items are assumed to have the same discriminative power. In practice of educational testing it is not uncommon that a test measures more then one ability and that some of the test items are more closely related to the latent trait than the other.
Two existing models – the between-item multidimensional model and the one parameter logistic model [OPLM] – relax the assumptions of the RM without losing its important property of sufficiency of test score. Both models imply that a test consists of sub-scales of items in which the RM holds. In both approaches though it is assumed that test structure is known and these sub-scales are pre-specified. In practice this information is not always available. We propose a multi-unidimensional Rasch model which also assumes that a test consists of Rasch sub-scales but scale memberships of items are considered as parameters that have to be estimated.
A Markov chain Monte Carlo algorithm is introduced for estimation of the model. The algorithm allows to identify Rasch sub-scales constituting the test. The performance of the algorithm is evaluated using simulations. Rasch scales are recovered both when they represent separate abilities as in the between-item multidimensional model, and when they differ only in the discrimination power as in the OPLM.
Project 2: “Hypothesis testing based on the unmixed Rasch scales”
In the multi-unidimensional Rasch model introduced in Project 1 the person parameters are assumed to have a multivariate normal distribution. The variance-covariance matrix of this distribution specifies the relations between person parameters and can be used to distinguish three types of models. In the unconstrained model the variances of separate person parameters are different and the correlations between them are also different. In this model person parameters can be interpreted as different abilities. We can also put constraints on the relations between person parameters and set all correlation between them to 1. In this model the theta’s associated with each dimension are the same but have a different scaling. In this model the standard deviation of the distributions of person parameters has the same interpretation as the discrimination index in the OPLM. Finally we can constrain the variances of each dimension to be the same, which yields the Rasch model.
In the second project a test will be developed that can be used to determine which model is most appropriate for a data set of interest.
Project 3: “Rasch models for test equating using prior knowledge”
Image that a test consisting of 40 items is presented to persons taking an exam in the year 2010 (the reference exam). Imagine also a test consisting of 40 new items that is presented to persons taking an exam in the year 2011 (the current exam). The main goal of test equating is to determine a pass/fail criterion such that the ability of persons just passing the exam in 2011 is equal to the ability of persons just passing the exam in 2010.
In order to be able to equate both tests, there has to be a so called linking group of persons that responds to some of the items from the 2010 exam and some of the items from the 2011 exam. Using the data resulting from the reference group, the linking group, and the current group, and assuming that responses to the 40 item from 2010 and the 40 items from 2011 can be modeled using the Rasch model, both tests can be equated. This equating procedure accounts for the fact that the reference and current exam may not be of the same difficulty and the fact that the reference and current populations may not be of the same ability.
However, there is a major weak point in test equating using the Rasch model: often the linking group is small and the number of items responded to by the linking group is also small. This implies that the link between both exams is weak, and that the credibility interval around the estimate of the norm score obtained is rather large. Project 3 will show that test equating using prior knowledge may be an important step towards a solution of this problem.