弱监督关系抽取利用已有关系实体对从文本集中自动获取训练数据,有效解决了训练数据不足的问题。针对弱监督训练数据存在噪声、特征不足和不平衡,导致关系抽取性能不高的问题,文中提出NF-Tri-training(Tritraining with Noise Filtering)弱监督关系抽取算法。它利用欠采样解决样本不平衡问题,基于Tri-training从未标注数据中迭代学习新的样本,提高分类器的泛化能力,采用数据编辑技术识别并移除初始训练数据和每次迭代产生的错标样本。在互动百科采集数据集上实验结果表明NF-Tri-training算法能够有效提升关系分类器的性能。
Weakly supervised relation extraction utilizes entity pairs to obtain training data from texts automatically,which can effectively deal with the problem of inadequate training data.However,there are many problems in the weakly supervised training data such as noise,inadequate features,and imbalance samples,leading to low performance of relation extraction.In this paper,a weakly supervised relation extraction algorithm named NF-Tri-training(Tri-training with Noise Filtering)is proposed.NF-Tri-training employs an under-sampling approach to solve the problem of imbalance samples,learns new samples iteratively from unlabeled data and uses a data editing technique to identify and discard possible mislabeled samples both in initial training data and in new samples generating at each iteration.The experiment on dataset of Hudong encyclopedia indicates the proposed method can improve the performance of relation classifiers.