为了有效的利用蛋白质串联质谱数据,提高蛋白质鉴定的准确性,提出一种基于KNN的蛋白质序列与蛋白质串联质谱的匹配打分算法.蛋白质序列与蛋白质串联质谱的匹配打分是蛋白质数据库搜索鉴定过程中的关键技术.然而,现有的算法没有很好的利用蛋白质串联质谱中离子的强度信息.针对此问题,本文根据质谱中离子的类型给出了全体离子的一个合理的划分,进而抽象出一个高维的强度特征向量,在已知的高精度的数据集上建立了强度匹配知识集合,最后基于KNN技术构造了序列和质谱的匹配打分算法.实验结果表明,本文算法更加有效的利用了蛋白质串联质谱的结构信息,提高了蛋白质鉴定的准确性.
A scoring approach is proposed for protein identification which evaluates the matching between protein sequence and protein tandem mass spectra based on KNN technology in a database search sketch.The scoring method between protein sequence and spectrum has been the key technique for protein identification in database search approaches.However,the available approaches do not make the best use of the intensity information of the ions in the spectrum.Focusing on this problem,we propose a method making use of the intensity information to improve the accuracy of the protein identification.A high-dimensional vector is extracted based on the total intensity of the same kind of ions in the spectrum and a KNN based scoring method is proposed.Experimental results showed that the proposed approach can effectively improve the accuracy of protein identification.