在这篇论文,我们处理由在大基于磁盘的备份系统压缩冗余性改进备份和恢复性能的问题。我们分析一些一般压缩算法;评估他们的可伸缩性和适用性。我们在整个系统范围调查冗余的数据的分发特征,并且建议能在在备份环境减少冗余性的文件水平,块水平或字节水平的颗粒度认出复制数据的一个 multi-resolutiondistributed 压缩算法。以便加速恢复,我们建议以一个面向恢复的方法存储数据的一个合成备份答案并且能在后端备份填写最后的数据服务者。实验证明这个算法罐头极大地减少带宽消费,节省存储费用,并且弄短备份和恢复时间。我们在我们的产品实现这些技术,把 H 信息称为备份系统,它能够在备份期间在网络利用和数据存储在 10x 压缩率上完成。
In this paper, we deal with the problem of improving backup and recovery performance by compressing redundancies in large disk-based backup system. We analyze some general compression algorithms; evaluate their scalability and applicability. We investigate the distribution features of the redundant data in whole system range, and propose a multi-resolution distributed compression algorithm which can discern duplicated data at granularity of file level, block level or byte level to reduce the redundancy in backup environment. In order to accelerate recovery, we propose a synthetic backup solution which stores data in a recovery-oriented way and can compose the final data in back-end backup server. Experiments show that this algorithm can greatly reduce bandwidth consumption, save storage cost, and shorten the backup and recovery time. We implement these technologies in our product, called H-info backup system, which is capable of achieving over 10x compression ratio in both network utilization and data storage during backup.