选取25条CPP和16条非CPP作为训练集样本,以61条CPP和21条非CPP为预测集样本.利用氨基酸的z-Scale对肽链进行编码,分别使用原始72个自交叉协方差变量和它们的主成分矢量进行线性判别(LDA)和支持矢量机(SVM)分类研究.当采用LDA方法时,对于训练集的预测以及它们的留一法交互检验,均获得比较优越的结果,但对预测集的预测总的识别率的最优结果仅为57.3%.分别利用主成分和原始变量集作为SVM的输入建立的非线性识别模型,对训练集的总识别率分别为85.4%和100%,留一法交互检验的总识别率分别为80.5%和75.6%,对预测集的最优总识别正确率为74.4%.识别结果表明SVM能够比较好的提取原始变量间的细微模式变化,对CPP总的识别结果优于LDA.
In order to identify new potential CPPs, two methods, fisher's linear discriminant analysis (LDA) and support vector machine (SVM), have used to construct two classifiers. We have identified 123 known natural CPPs from the literature and used them to construct 2 data sets, the training set with 25 CPPs and 16 non-CPPs and the test set with 61 CPPs and 21 non-CPPs. The auto cross covariances (ACCs) by describing each amino acid by principal properties (z-scales) and their main compounds were used to con- struct classifiers, respectively. The obtained models, using fisher's LDA, were only able to classify correctly 57.3% on test sets, whereas these models showed large classification rates on the training sets in training and cross-validation procedures. The classification rates using SVM tool were 100% (75.6%) and 85.4% (80.5%) on the training test in training (Loo-cross-validation), when 72 ACCs and their main components were used for classification. The best result for SVM classification on test set is 74.4% using 72 ACCs. These results validate that the SVM can extract the minor change in variables. The SVM's model is better than LDA model.