针对蛋白质交互作用关系(PPI)抽取研究中已标注语料有限而未标注生物医学自由文本易得的问题,进行了基于直推式支持向量机(TSVM)与主动学习融合的蛋白质交互作用关系抽取研究。通过自主选择最优的未标注样本加入到TSVM的训练过程中,最大程度地提高了系统的性能。实验结果表明,TSVM与主动学习融合的算法在少量已标注样本和大量未标注样本组成的混合样本集上取得了较好的学习效果,与传统的支持向量机(SVM)和TSVM算法相比,能有效地减少学习样本数,提高分类精度,在Aimed语料上取得了F测度为64.12%的较好性能。
This paper presents an algorithm for extraction of protein-protein interaction (PPI) based on the combination of the transductive support vector machine (TSVM) approach with the active learning algorithm to solve the problems which are the lack of labeled corpora and the easy usage of the vast amount of unlabeled biomedical free texts. The algorithm can maximally increase the performance of the TSVM algorithm through actively selecting useful unlabeled samples and adding them to the TSVM training set. The experiment results show that combing TSVM with the active learning is very promising on a mixed training set with a small number of labeled samples and a large number of unlabeled samples. Compared with the traditional support vector machine (SVM) algorithm and the TSVM algorithm, the paper proposed algorithm can im- mensely reduce the number of the training data and efficiently improve the performance of the classifier for PPI extraction. A very encouraging result of 64.12% F-score on the Aimed corpus was achieved.