蛋白质的亚细胞定位与蛋白质的功能密切相关,其定位预测有助于人们了解蛋白质功能。文章提出一种分段伪氨基酸组成成分特征提取方法,采用支持向量机算法对Chou构建的两个蛋白质亚细胞定位数据集(C2129,CS2423)进行了分类研究,并采用总分类精度Q3、内容平衡精度指数Q9等参数评估预测分类系统性能。预测结果表明,基于分段伪氨基酸组成成分特征提取方法的预测性能,优于基于完整蛋白质序列的伪氨基酸组成成分特征提取方法。例如,基于分段矩描述子伪氨基酸组成成分特征提取方法,数据集C2129的Q3和Q9分别为84.7%和60.8%,比基于完整蛋白质序列的矩描述子伪氨基酸组成成分特征提取方法分别提高1.8和2.2个百分点,且Q3比现有Xiao等人的方法提高了9.1个百分点。基于分段伪氨基酸组成成分特征提取方法构成的特征向量不仅包含残基之间的位置信息,而且还包含蛋白质子序列之间的耦合信息,另外蛋白质分段子序列可能和蛋白质的功能域有一定的联系,从而使这一方法能够有效地预测蛋白质亚细胞定位。
Knowing the protein subcellular localizations is important because it can provide useful insights about the protein functions, as well as how and in what kind of cellular environments the proteins interact with each other and with other molecules. A novel feature extraction method: sequence-segmented pseudo amino acid composition (PseAAC) has been developed to predict protein subcellular localizations for the two databases (C2129, CS2423) which were first constructed by Chou and Shen. The authors took Support vector machines as classifier, and used the parameters of overall accuracy Q3, content-balance accuracy index Q9 etc to evaluate the performance of prediction system. The results show that performance of the sequence-segmented PseAAC method is better than that of the PseAAC which extracts feature factor sets from full sequence. For example, the Q3 and Q9 of sequence-segmented moment descriptors PseAAC for database C2129 are 84.7%, 60.8% respectively, which are 1.8 and 2.2 percentage points higher than that of moment descriptors PseAAC, and the Q3 of the sequence-segmented moment descriptors PseAAC is also 9.1 percentage points higher than Xiao's method. The feature vector sets extracted with the sequence-segmented PseAAC method not only contain the order information between the residues, but also contain the coupled information among the sub-sequences, and the sub-sequences maybe has correlation with the protein functional domains. The method of the sequence-segmented PseAAC is an effective method for predicting protein subcellular localizations.