基于一种新的特征提取方法——氨基酸组成分布,使用支持向量机作为成员分类器,采用“一对一”的多类分类策略,从蛋白质一级序列对四类同源寡聚体进行分类研究。结果表明,在10-CV检验下,基于氨基酸组成分布,其总分类精度和精度指数分别达到了86.22%和67.12%,比基于氨基酸组成成分的传统特征提取方法分别提高了5.74和10.03个百分点,比二肽组成成分特征提取方法分别提高了3.12和5.63个百分点,说明氨基酸组成分布对于蛋白质同源寡聚体分类是一种非常有效的特征提取方法;将氨基酸组成分布和蛋白质序列长度特征组合,其总分类精度和精度指数分别达到了86.35%和67.23%,说明蛋白质序列长度特征含有一定的空间结构信息。
Since the gap between sharply increasing known sequences and slow accumulation of known structures is becoming large, an automatic classification process based on the primary sequences and known three-dimensional structure becomes more important nowadays. Meanwhile, a fully automatic and reliable classification system is also necessary due to the importance of primary sequences which contain much useful information for the biologists. Generally, the performance of the classification system can be improved by selecting appropriate algorithm of feature extraction. Thus a novel method of feature extraction (amino acid composition distribution, AACD) from the sequences has been developed to classify the protein homo-oligomers, which is a generalization of the 20 components of the conventional amino acid composition. The primary sequence is equally separated into several segments, and each element of the AACD array can be individually calculated by the count of 20 natural amino acids appearing within each segment divided by the length of corresponding sequence. The classification system takes support vector machines as classifier, and adopts "One-Versus-One" as multi-class categorization, and finally applies AACD to 4-class homo-oligomers classification from the primary sequence of proteins. The results of 10 fold cross validation (10CV) test show that overall accuracy and accuracy index of AACD are 86.22% and 67.12%, which are 5.74 and 10.03 per cent higher than those of amino acid composition, and 3.12 and 5.63 per cent higher than those of dipeptide composition (amino acid pairs) feature extraction method respectively. Incorporating AACD with the length of protein primary sequence can slightly improve that performance with overall accuracy 86.35% and accuracy index 67.23%. Using two-dimension principle component analysis (2DPCA) to decrease the dimension of those incorporated feature vectors can get better results with overall accuracy 87.12% and accuracy index 68.08% respectively. The res