基于蛋白质的合成及分选机制,提出了一种新的蛋白质亚细胞定位预测方法。先采用遍历搜索技术,找出各种亚细胞蛋白质序列分选信号和成熟蛋白质之间的最佳分割位点,把蛋白质序列分为两条子序列,计算这两条子序列中的氨基酸组份并将它们融合起来作为整条蛋白质序列的特征,然后构造用于识别每类蛋白质的最佳子分类器,再根据最大化原则组建集成分类器。在NNPSL数据集上,采用5重交叉验证方法对本文方法进行测试,原核和真核两个蛋白质序列子集分别取得94.1%和87.5%的总体预测精度。同时,此方法在一些蛋白质序列中找到的分割位点与真实生物现象相吻合,能为预测蛋白质序列的剪切位点提供参考信息。
Prediction of protein subcellular localization can help infer the function of proteins and apply insight into the interaction between proteins. A novel approach based on the sorting mechanism of proteins, is proposed for predicting subcellular localization of proteins. An optimal splice site is found through iterative searching technique to divide the sequence into sorting signal and mature protein subsequenee for each kind of proteins. When designing the classifier, a sub-classifier is built to discriminate each kind of protein from the rest, these sub-classifiers are then combined into an ensemble classifier to predict the subcellular localization of unknown proteins. Through fivefold cross-validation tests on NNPSL datasets and TargetP datasets, overall accuracies of 94. 1% and 87.5% are obtained for prokaryotie and eukaryotie proteins respectively, as for TargetP datasets, the overall accuracies are 90. 2% and 93.9% for plant and non-plant proteins respectively. Meanwhile, the optimal splice sites found in this paper are coincided with the biological facts in most of kinds protein, this can help predict the cleavage sites of proteins.