目的 随着高通量测序技术的发展,产生了大量的微生物16S rRNA基因序列数据.对该数据进行精确的微生物操作分类单元(operational taxonomic unit,OTU)划分,有助于了解环境中微生物的种群组成及分布.方法 本文在真实数据集与模拟数据集上,对现有的7种流行OTU单元聚类算法进行了对比研究,并分析了这些算法的优缺点及使用范围.结果 序列长度、测序深度对聚类结果均有影响.结论 相同的序列相似性阈值下,不同的聚类算法聚类结果差异较大,其中CROP算法的鲁棒性和抗噪性较好.
Objective Recent advance of high-throughput next-generation sequencing technology allows us to generate a great deal of 16 S rRNA sequences.We can explore the population composition and distribution of the environmental microbes by accurately clustering the 16 S rRNA sequences into operational taxonomic units (OTU).Methods In the present work,we conducted a comprehensive evaluation of seven existing methods for OTU inference based on both real and simulated data,and identified the advantages and limitation of these algorithms.Results We found the sequence length and sequencing depth affected the OTU results.Conclusions At the same sequence similarity threshold,the clustering results of these clustering algorithms are different and the CROP algorithm is robust and insensitive to noise.