与现有绝大多数以单个句子为依据的蛋白质自动识别方式不同,文中基于大规模语料库提出了引入句法和单词相似性这两个因素的蛋白质交互自动识别方法。首先,采用基于特征的方法对蛋白质对签名档进行分类。然后,使用分词工具对蛋白质对签名档进行词性标注,将不同词性的特征词语进行分组,并对每种词性进行加权。最后,基于大规模语料库的方法计算得到单词相似性,根据单词在正、负类中频率的差别调整单词相似性矩阵。实验结果表明,引入词性加权和单词相似性两个因素后,最终的分类结果较基准模型的识别精度有了明显的提升。
Be different from the existing vast majority of Protein-Protein Identification (PPI) based on a sentence ,in this paper,put for- ward a new PPI identification method that introduces syntax and word similarity based on large-scale corpus. First of all, feature-based method is used to classify the protein signature. Then, a segmentation tool is used to Part-Of-Speech (POS) tag protein signatures, so that,feature words based on different POS are grouped and different weights are assigned to each POS of words. Finally,word similarity is calculated through the method based on large-scale corpus and the word similarity matrix is adjusted by the difference in the frequen- cies between positive class and negative class. The experimental results show that once the weighted POS and word similarity are intro- duced,the final classification accuracy is obviously improved than the benchmark model.