通过选取并提交专家标注最有信息量的样例,主动学习算法中可以有效地减轻标注大量未标注样例的负担.采样是主动学习算法中一个影响性能的关键因素.当前主流的采样算法往往考虑选取的样例尽可能平分版本空间.但这一方法假定版本空间中的每一假设都具有相同的概率成为目标函数,而这在真实世界问题中不可能满足.分析了平分版本策略的局限性.进而提出一种旨在尽可能最大限度减小版本空间的启发式采样算法MPWPS(the most possibly wrong-predicted sampling),该算法每次采样时选取当前分类器最有可能预测错误的样例,从而淘汰版本空间中多于半数的假设.这种方法使分类器在达到相同的分类正确率时,采样次数比当前主流的针对平分版本空间的主动学习算法采样次数更少.实验表明,在大多数数据集上,当达到相同的目标正确率时,MPWPS方法能够比传统的采样算法采样次数更少.
Active learning methods can alleviate the efforts of labeling large amounts of instances by selecting and asking experts to label only the most informative examples. Sampling is a key factor influencing the performance of active learning. Currently, the leading methods of sampling generally choose the instance or instances that can reduce the version space by half. However, the strategy of halving the version space assumes each hypothesis in version space has equal probability to be the target function which can not be satisfied in real world problems. In this paper, the limitation of the strategy of halving the version space is analyzed. Then presented is a sampling method named MPWPS (the most possibly wrongpredicted sampling) aiming to reduce the version space more than half. While sampling, MPWPS chooses the instance or instances that would be most likely to be predicted wrong by the current classifier, so that more than half of hypotheses in the version space are eliminated. Comparing the proposed MPWPS method and the existing active learning methods, when the classifiers achieve the same accuracy, the former method will sample fewer times than the latter ones. The experiments show that the MPWPS method samples fewer instances than traditional sampling methods on most datasets when obtaining the same target accuracy.