针对传统远程备份中大量冗余数据导致备份效率低下和存储空间浪费的问题,设计并实现了一个基于重复数据删除的远程备份系统。首先根据文件的内容用Rabin指纹将备份文件划分为变长的数据块,把每个数据块的相关信息发送到备份中心,在备份中心利用Google Bigtable及Leveldb的索引算法辅以布隆过滤器对数据块进行判重,最后只传输和存储不重复的数据块。实验结果表明,采用该系统备份相似的数据集能够有效删除其q-的重复数据。对数据集进行增量备份,在增量数据变化不大时,相比Rsync备份有更少的网络流量。
To the problem that a large number of redundant data caused inefficient backup and storage waste in traditional remote backup, a remote backup system based on data de-duplication is designed and implemented. Backup files are divided into variable length chunks based on Rabin fingerprint of contents. Chunks' information is sent to backup centre where duplicate chunks are sought by using Google Bigtable and Leveldb index algorithm along with bloom filter. Finally, it only transmitted and stored unique chunks. Experimental results show that, it can remove duplicate data effectively to backup similar data sets. Compared with Rsync backup, it has less network flow when it does a incremental backup which has small incremental data.