作为生物医学信息抽取领域的重要分支,蛋白质交互关系(Protein-Protein Interaction,PPI)抽取具有重要的研究意义。目前的研究大多采用统计机器学习方法,需要大规模标注语料进行训练。训练语料过少,会降低关系抽取系统的性能,而人工标注语料需要耗费巨大的成本。该文采用迁移学习的方法,用大量已标注的源领域(其它领域)语料来辅助少量标注的目标领域语料(本领域)进行蛋白质交互关系抽取。但是,不同领域的数据分布存在差异,容易导致负迁移,该文借助实例的相对分布来调整权重,避免了负迁移的发生。在公共语料库AIMed上实验,两种迁移学习方法获得了明显优于基准算法的性能;同样方法在语料库IEPA上实验时,TrAdaboost算法发生了负迁移,而改进的DisTrAdaboost算法仍保持良好迁移效果。
As an important branch of biomedical information extraction,Protein-Protein Interaction(PPI)extraction has great research significance.Currently,research of PPI mainly focuses on traditional machine learning,which requires the use of large amounts of annotated corpus for training and makes it costly to label the new data.This paper employs Transfer Learning in extracting PPI with a small amount of labeled data of target domain(in-domain),drawing support from annotated data of source domain(out-of-domain).To avoid the negative transfer caused by large differences between the distributions of different domains,we adjust the weights of each instance from source domain,depending on its relative distribution.Experiments on the AIMed corpus and on IEPA corpus reveals the efficiency of our alogrithems.