宏基因组测序序列分类问题是宏基因组学研究的一个重点问题.影响宏基因组分类性能的主要因素是特征向量的提取问题,如何提取并产生合适的特征向量对于提高宏基因组分类问题的分类精度和运行时间有着重大影响.因此,针对宏基因组分类问题的数据特点,利用三阶马尔可夫模型的性质,提出了一种基于转移概率矩阵的特征提取方法,并采用基于互信息的特征选择算法对提取后的特征向量进行降维处理,最后将新提出的特征向量应用到SVM分类算法中,并与相关算法进行了性能对比.结果显示,新提出的特征向量在不同的宏基因组物种之间有着良好的区分度,特别适用于大规模宏基因组数据的分类问题.
Metagenomic binning is a fundamental question for metagenomic studies. Features extraction is the main factor which influences the performance of metagenomic binning, and how to extract the appropriate feature vectors will influence the binning accuracy and running time. Therefore, this paper proposes a features extraction method which based on third-order Markov model and transferring probability matrix for metagenomic binning problem. Meanwhile, we employ the features selection method based on mutual information to reduce the dimensions of feature vectors and apply it to support vector machine algorithm for binning as well as making comparisons among similar binning algorithms. The results show that this new features extraction method possesses applicable discriminability among different metagenomic species, which is particularly appropriate for large-scale metagenomic binning problem.