提出一种蛋白质二级结构预测的新方法。该方法首先对数据集中的氨基酸序列利用PSI—BLAST程序进行同源序列搜索,得到相应的PSSM矩阵,然后利用滑动窗口方法对矩阵进行编码,得到分类器的输入。采用分类器集成,将所有的样本划分成9个互斥训练集对单个子分类器进行训练。然后,9个单独的0—1子分类器通过最大投票法进行集成,形成识别一种特定的蛋白质二级结构的0—1分类器。这样3个0—1分类器模型通过串行集成,可以对蛋白质的三种二级结构(14/E/C)进行识别。通过对标准数据集RSl26.CB396。CB513进行测试发现,对于同一分类器,利用PSSM矩阵作为分类器输入的预测准确率要高于直接将蛋白质序列作为输入的预测率。
To predict the secondary structure of protein, a new method is introduced. In this method,amino acid sequences from datasets are firstly preprocessed through a homologous sequences searching program called PSI - BLAST, then input dataset of the classifier is generated which is coded by the slide window from PSSM ( position - specific scoring matrix ) matrix obtained by the PSI - BLAST program. To increase the generalization ability of the classifier,all samples will be divided into nine contradictable training sets to train the individual classifier. During the training process, the classifier can get different decision functions as all the training set is different from each other. Through integration, the accuracy of the classification will be greatly enhanced. The nine separate 0 - 1 classifiers are integrated by the Majority - Voting law to identify a specific protein secondary structure. By this way, three 0 - 1 classifier ensembles integrated serially can identify three kinds of protein secondary structures. A single classifier is based on the forward neural network and optimized by the PSO algorithm. Through the result statistic,the prediction accuracy through the RS126 is lower than that through the PSSM matrix.