位置:成果数据库 > 期刊 > 期刊详情页
不同压缩程序对海量生物信息数据压缩效率的比较分析
  • 期刊名称:生物信息学. 2009, 7(3):196-201 (中国期刊网数据库)
  • 时间:0
  • 分类:TP31[自动化与计算机技术—计算机软件与理论;自动化与计算机技术—计算机科学与技术]
  • 作者机构:[1]军事医学科学院放射与辐射医学研究所,蛋白质组学国家重点实验室,北京100850
  • 相关基金:国家重点基础研究发展规划项目(973计划)(2006CB504100,2003CB715900),国家自然科学基金(30771230,30772293).
  • 相关项目:缺血缺氧性脑损伤相关的血脑屏障差异蛋白质组学研究
中文摘要:

海量生物信息数据的不断涌现迫切需要在数据压缩技术方面进行更多研究,以减轻服务器存储压力和提高网络传输及数据分析的效率。目前虽然已开发出大量数据压缩软件,但对于海量生物信息数据而言,应该选用何种软件和方法进行数据压缩,尚缺乏详细的综合比较分析。本文选择生物信息学领域中GenBank数据库中的典型核酸和蛋白质序列数据库以及典型生物信息软件Blast和EMBOSS为例,采用不同数据压缩软件进行综合比较分析,结果发现经典压缩软件compress的总体压缩效率很高,除压缩比率可接受之外,其压缩时间相对其他软件而言显著减少,甚至比并行化的hzip2(pbzip2)和gzip(pigz)软件的时间还少很多,故可优先考虑使用。7-Zip软件虽然具有最高的压缩比率,但压缩过程十分耗时,可用于数据的长期储存;而采用bzip2、rar以及gzip等软件压缩的文件,虽然压缩比率较7-Zip的偏低,但压缩过程相对而言还比较快速。具体应用中推荐使用经典压缩软件compress以及并行化运行的pbzip2和pigz软件,三者可作为同时兼顾压缩比率和压缩时间的优选。

英文摘要:

High performance compression techniques are urgently needed for saving the demand of large computer servers to store progressively increased huge bioinformatics data and improving the efficiency of network transferring and data analysis in nowadays. Although there are a large number of compression programs currently available, however, it is not clear which program is suitable for huge bioinformatics data compression. Here we choose several typical nucleic acid sequences and protein sequences from the GenBank databases, and two widely used programs in bioinformatics, Blast and EMBOSS, as the sample data sets for compression efficiency analysis between different compression programs with recommended parameters. Results demonstrated that the classical program compress is quite surprising for using preferentially because the compression ratio is acceptable and the speed is also faster even than the parallel versions of bzip2 (pbzip2) and gzip (pigz). Furthermore, the compression ratio of 7 - Zip is the highest but with worst consumed time, indicating that it should be mainly used for data storing. For the programs of bzip2, rar and gzip, the compression time is significantly reduced than that of 7 - Zip, while the compression ratio is also acceptable. In summary, we suggested to use compress and the parallel version of the programs for bzip2 (pbzip2) and for gzip (pigz) to achieve a balance for both compression ratio and compression time.

同期刊论文项目
同项目期刊论文