在以单词为特征的模型中,如果特征单词在不同类别中的使用情况存在明显差异,那么它对分类有着很重要的影响。因此文中基于大规模语料库,研究不同的特征加权方法对PPI识别的影响。首先,通过搜索医学文献数据库建立蛋白质对的签名档,以单词作为描述蛋白质对关系的特征,构建向量空间模型;然后,选择不同的加权方法描述单词重要性;最后,以K近邻和SVM分类方法构建分类器判断蛋白质对是否存在交互关系。实验结果表明,根据特征向量单词的重要性进行加权,PPI识别精确度、召回率和准确率有了明显的提高。
In a model characterized by word,if the use of feature word in different categories exists obvious differences,it will have a very important impact on classification. Based on a large- scale corpus,study the effects of different methods of feature weighting on protein- protein interaction identification. Firstly,the signature of a protein pair is obtained by searching large scale biomedical text. Taking the words as the features which describe the relationship between the protein pair,construct Vector Space Model( SVM). Then,select different weighting methods to describe the importance of words. Finally,K nearest neighbor and SVMclassifier are applied to identify PPIs.According to the experimental results,PPI recognition accuracy and recall and precision have been significantly improved when the feature vectors are weighted.