海量生物信息数据的不断涌现迫切需要在数据压缩技术方面进行更多研究,以减轻服务器存储压力和提高网络传输及数据分析的效率。目前虽然已开发出大量数据压缩软件,但对于海量生物信息数据而言,应该选用何种软件和方法进行数据压缩,尚缺乏详细的综合比较分析。本文选择生物信息学领域中GenBank数据库中的典型核酸和蛋白质序列数据库以及典型生物信息软件Blast和EMBOSS为例,采用不同数据压缩软件进行综合比较分析,结果发现经典压缩软件compress的总体压缩效率很高,除压缩比率可接受之外,其压缩时间相对其他软件而言显著减少,甚至比并行化的hzip2(pbzip2)和gzip(pigz)软件的时间还少很多,故可优先考虑使用。7-Zip软件虽然具有最高的压缩比率,但压缩过程十分耗时,可用于数据的长期储存;而采用bzip2、rar以及gzip等软件压缩的文件,虽然压缩比率较7-Zip的偏低,但压缩过程相对而言还比较快速。具体应用中推荐使用经典压缩软件compress以及并行化运行的pbzip2和pigz软件,三者可作为同时兼顾压缩比率和压缩时间的优选。
High performance compression techniques are urgently needed for saving the demand of large computer servers to store progressively increased huge bioinformatics data and improving the efficiency of network transferring and data analysis in nowadays. Although there are a large number of compression programs currently available, however, it is not clear which program is suitable for huge bioinformatics data compression. Here we choose several typical nucleic acid sequences and protein sequences from the GenBank databases, and two widely used programs in bioinformatics, Blast and EMBOSS, as the sample data sets for compression efficiency analysis between different compression programs with recommended parameters. Results demonstrated that the classical program compress is quite surprising for using preferentially because the compression ratio is acceptable and the speed is also faster even than the parallel versions of bzip2 (pbzip2) and gzip (pigz). Furthermore, the compression ratio of 7 - Zip is the highest but with worst consumed time, indicating that it should be mainly used for data storing. For the programs of bzip2, rar and gzip, the compression time is significantly reduced than that of 7 - Zip, while the compression ratio is also acceptable. In summary, we suggested to use compress and the parallel version of the programs for bzip2 (pbzip2) and for gzip (pigz) to achieve a balance for both compression ratio and compression time.