在线标定技术由于具有诸多优点而被广泛应用于计算机化自适应测验(CAT)的新题标定。Method A是想法最直接、算法最简单的CAT在线标定方法,但它具有明显的理论缺陷——在标定过程中将能力估计值视为能力真值。将全功能极大似然估计方法(FFMLE)与"利用充分性结果"估计方法(ECSE)的误差校正思路融入Method A(新方法分别记为FFMLE-Method A和ECSE-Method A),从理论上对能力估计误差进行校正,进而克服Method A的标定缺陷。模拟研究的结果表明:(1)在大多数实验条件下,两种新方法较Method A总体上可以改进标定精度,且在测验长度为10的短测验上的改进幅度最大;(2)当CAT测验长度较短或中等(10或20题)时,两种新方法的表现与性能最优的MEM已非常接近。当测验长度较长(30题)时,ECSE-Method A的总体表现最好、优于MEM;(3)样本量越大,各种方法的标定精度越高。
With the development of computerized adaptive testing (CAT), many new issues and challenges have been raised. For example, as the test is continuously administered, some new items should be written, calibrated, and added to the item bank periodically to replace the flawed, obsolete, and overexposed items. The new items have to be precisely calibrated because the calibration precision will directly affect the accuracy of ability estimation. The technique of online calibration has been widely used to calibrate new items on-the-fly in CAT, since it offers several advantages over the traditional offline calibration approach. As the simplest and most straightforward online calibration method, Method A (Stocking, 1988) has an obvious theoretical limitation in that it treats the estimated abilities as true values and ignores the measurement errors in ability estimation. To overcome this weakness, we combined a full functional maximum likelihood estimator (FFMLE) and an estimator which made use of the consequences of sufficiency (ECSE) (Stefanski & Carroll, 1985) with Method A respectively to correct for the estimation error of ability, and the new methods are referred to as FFMLE-Method A and ECSE-Method A. A simulation study was conducted to compare the two new methods with three other methods: the original Method A [denoted as Method A (Original)], the original Method A which plugs in the true abilities of examinees [Method A (True)], and the “multiple EM cycles” method (MEM). These five methods were evaluated in terms of item-parameter recovery and calibration efficiency under three levels of sample sizes (1000, 2000 and 3000) and three levels of CAT test lengths (10, 20 and 30), assuming the new items are randomly assigned to examinees. Under the two-parameter logistic model, the true abilities for the three groups of examinees were randomly drawn from the standard normal distribution [N (0,1)]. For all conditions, 1000 operational items were simulated to constitut