目的通过提出一种新颖的生物信息学算法,以准确识别已知磷酸化位点的蛋白激酶信息,进而解决蛋白激酶的信息缺乏问题。方法根据人类激酶的聚类规则,首先从最新版本的磷酸化数据库Phospho.ELM(9.0)中提取出激酶特异性的磷酸化数据,构建用于激酶识别的数据集。然后基于贝叶斯决策理论,分析阳性数据和阴性数据中磷酸化位点附近的氨基酸分布规律,进而给出相应的统计模型并使用留一法对模型进行评估。结果对MAPK、PKA和RSK 3个激酶家族的测试表明,在假阳性率不超过1%的高置信度水平下,激酶识别的准确率分别达到了23%、24%和33%。同时,该算法的识别结果明显优于KinasePhos、Netphosk等蛋白质磷酸化位点预测方法。结论本文提出的基于贝叶斯决策理论的磷酸化位点激酶信息识别算法可有效提高对已知磷酸化位点的蛋白激酶识别性能,有助于理解蛋白质磷酸化的生物机制。
Objective A novel machine learning method is proposed to identify protein kinase for known phosphorylation sites,which can solve the problem of lacking kinase information.Methods According to the hierarchy structure of human kinases,we firstly constructed datasets for each kinase or kinase cluster by using the kinase-specific phosphorylation instances extracted from the latest version of Phospho.ELM(9.0).Based on Bayesian decision theory,we analyzed the amino acid distribution of each residue around the phosphorylation sites in positive and negative dataset respectively and constructed corresponding statistical models.In addition, we evaluated the performance of this algorithm by using leave one out strategy in various datasets.Results The sensitivities of MAPK,PKA and RSK reached 23%,24% and 33% when the false positive rate was 1%.The prediction performance was also significantly better than phosphorylation site prediction methods such as KinasePhos and Netphosk.Conclusions The proposed algorithm based on Bayesian decision theory effectively enhanced the identification performance and contributed to better understanding of the biological mechanism in protein phosphorylation process.