Maaike M. van Groen

VanGroen.Maaike Cito and RCEC (Twente University)
Psychometric Research Center (POC)
Twente University

Supervisors
prof.dr. T. J. H. M. Eggen (Cito/Twente University)
prof.dr. B.P. Veldkamp (Twente University)

On November 21st 2014, Maaike van Groen defended her thesis entitled

Adaptive testing for making unidimensional and multidimensional classification decisions

Summary
Computerized adaptive tests (CATs) were originally developed to obtain an efficient estimate of the examinee’s ability, but they can also be used to classify the examinee into one of two or more levels (e.g. master/non-master). These computerized classification tests have the advantage that they can also be tailored to the individual student’s ability.

Computerized classification tests require a method that decides whether testing can stop and which decision with the desired confidence can be made. Furthermore, a method to select the items is required.

In classification testing for unidimensional constructs, items are often selected that attempt to measure optimal at either the cutoff point(s) or the student’s current ability estimate. Four methods were developed that combined the efficiency of the first approach with the adaptive item selection of the second approach. Their efficiency and accuracy was investigated using simulations.

Several methods are available to make the classification decisions for constructs modeled with an unidimensional item response theory model. But if the construct is multidimensional, few classification methods are available. A classification method based on Wald’s Sequential Probability Ratio Test was developed for application to CAT with a multidimensional item response theory model in which each item measures multiple abilities. Seitz and Frey’s (2013) method to make classifications per dimension, when each item measures one dimension, was adapted to make classifications on the entire test and on parts of the test. Kingsbury and Weiss’s (1979) popular unidimensional classification method, which uses the confidence interval surrounding the ability estimate, was also adapted for multidimensional decisions. Simulation studies were used to investigate the efficiency and accuracy of the classification methods. Comparisons were made between different item selection methods, between different classification methods and between different settings for the classification methods.

Tests can be used for formative assessment, formative evaluation, summative assessment, and summative evaluation. For seven types of tests, including computerized classification tests and educational games; the design, the possibility to adapt the test, and the possible use for each of the test goals was explored.

Project

Methods for making classification decisions

Most adaptive tests are constructed in order to estimate the examinees’ ability as efficient and accurate as possible. Computerized classification testing (CCT) has a different goal: classify the examinee as efficient and accurate as possible into mutual exclusive groups. A classification decision is made in which the examinee is assigned into one of two or more mutually exclusive categories along the ability scale (Lin & Spray, 2000) using cutting points to separate the categories (Eggen, 1999).

A computerized classification test is of variable length and examinees ‘’are classified as masters or non-masters as soon as there is enough evidence to make a decision’’ (Finkelman, 2008). The classification procedure must choose between three options: to stop testing and classify an examinee as a master, to stop testing and classify an examinee as a non-master, or to continue testing and select a new item. Several procedures are available for making the decisions but also for the way in which items are selected.

The sequential mastery test procedure has to determine whether it is possible to classify an examinee. The examinee is classified using some statistical rule which bases its decision on certain parameters (Thompson, 2009). Thompson (2009) has divided the termination methods into three types which all cover a range of comparable statistical procedures for termination decisions. The sequential probability ratio testing (SPRT) procedure uses hypothesis testing for classification of the respondents. The ability confidence intervals procedure, which was developed by Kingsbury and Weiss (1979; ACI), constructs intervals around the current ability estimate (Thompson, 2009). If the cut point is outside the interval a classification decision can be made. DeGroot (1970) has formulated a Bayesian decision theory for making statistical decisions. Rudner (2002) and Vos (2000) applied decision theory for making classifications of respondents. The objective of decision theory in the context of computerized classification testing is to make a best guess of the level of an examinee based on responses, a priori item information, and a priori population classification proportions (Rudner, 2002).

An item selection method selects an item using the current test performance of the examinee (Chang & Ying, 1996). In CAT, item selection methods are often based on maximization of a concept called information at the current estimate of the respondent’s ability. In CCT, item selection by selecting the item with the most information at the current estimate of the examinee’s ability is used frequently, but other approaches are also popular including maximizing information at the cut point. The value of Fisher information at the true ability level of an examinee reflects the efficiency of the item for estimation of that ability or said differently the information is inversely related to the error of the estimated ability (Chang & Ying, 1996). Other methods use a Bayesian approach (Owen,1975). Inclusion of exposure mechanisms and content constraints in the item selection procedures is also possible.

The major advantage of adaptive tests is that more efficient ability estimates are provided using fewer items than required by conventional tests (Weiss, 1982). Shorter tests reduce the testing time, the costs for the testing company, lessens the exposure rates of some items, reducing problems with test security and finally it reduces the frequency in which the item pool has to be changed (Finkelman, 2008). A shorter test is assembled for respondents who have clearly attained a certain level and longer tests for students for whom the decision is not as clear-cut. A second major advantage is that items are administered at the level of the respondents which increases examinees’ motivation and reduces the number of too easy or too difficult items that have to be answered by the examinees.
Although several item selection methods and methods for making the classification decision have been developed the focus of this research project will be on the development of new and investigation of current item selection methods and methods for making classification decisions. The following topics have been selected for the study:
• Test approaches and types of digital assessments
• Multiple objective item selection methods for unidimensional classification testing
• Multidimensional classification methods based on the SPRT for tests with between-item dimensionality
• Multidimensional classification methods based on the SPRT for tests with within-item dimensionality
• Multidimensional classification methods based on the SPRT and ACI for tests with between- and within-item dimensionality

This project was financed by Cito and RCEC (Twente University)