现有采用机器学习方法的蛋白质交互关系识别系统仅以单句为依据,并且存在标注数据缺乏导致训练集规模小的问题。为此,基于相似性混合模型提出一种新的蛋白质交互识别方法。采用基本的关系相似性(RS)模型做初始判断,利用大规模文本计算单词特征间的相似性,在基本RS模型的基础上通过特征聚类方式引入单词相似性模型,从而建立一个混合模型。实验结果表明,该方法能够取得较高且较均衡的精确度和召回率,而单词相似性的引入又进一步提高了F值,并且其直接利用已有的交互信息,可避免额外的人工标注。
Current machine learning-based Protein-protein Interaction (PPI)identification systems make predictions solely on evidence within a single sentence and suffer from small training set. In this paper, a hybrid similarity model- based approach is proposed to address these issues. A basic Relational Similarity (RS) model is established to make initial predictions. Word similarity matrices are constructed using a corpus-based approach. A clustering algorithm is applied to group words according to their similarity. The obtained word clusters are introduced to the basic RS model to build a hybrid model. Experimental results show that the basic RS model achieves higher and well-balanced precision and recall, and the introduction of the word similarity model further improves the F-score. This approach makes use of known PPI information, thus releases the burden of manual annotation.