针对生物文献库中人工标注样本数量缺乏的问题,提出一种半监督类型的基于联合训练的方法。在样本预处理的基础上,基于词特征的机器学习方法和基于模式学习的方法选择样本的不同特征子集,并被合成到联合训练方法中;在训练过程中每种方法能够利用少量初始标注样本和大量未标注样本进行学习,并用另一方法的学习结果扩充标注样本集。该方法在AIMED语料库中获得了63.9%的F1值,比较实验结果表明,该方法性能优于监督方法,且能有效利用未标注样本以适应实际抽取任务。
In order to solve the problem of lack of manually labeled samples,this paper proposed a semi-supervised co-training based method.After preprocessing,the bag of words based method and the pattern learning based method selected different subset of features in samples and were incorporated into co-training.In the training stage,each method could utilize a small set of initial labeled samples and a large set of unlabeled samples to learn and the results of the other method to enlarge labeled sample set.Tested in the AIMED corpus,this method achieved F1 value of 63.9%.The comparative experimental results show that the method outperforms supervised methods and can utilize unlabeled samples efficiently to be adaptive to the real extraction tasks.