为解决现有提高重复数据消除系统吞吐量方法的局部性依赖和多节点依赖问题,提出了一种基于文件相似性分簇的重复数据消除模型。该模型将传统平面型索引结构拓展为空间结构,并依据Broder定理仅选择少量最具代表性的索引驻留在内存中;同时对索引进行横向分片并分布到完全自治的多个节点。实验结果表明,该方法能有效提高大规模云存储环境下重复数据消除性能和平均吞吐量,且各节点数据负载量均衡,故该模型可扩展性强。
To resolve the locality dependence and multiple-nodes dependence problems of the current throughput improving methods for deduplication system,this paper proposed a deduplication model based on file-similarity clustering.This model expanded the traditional flat index structure into spatial structure.According to the Broder's theorem,it kept only a handful of the most representative indices in RAM.It partitioned the index horizontally and distributed on several totally autonomous storage nodes.The experimental results indicate that the model can effectively improve the deduplication performance and the throughput on average in the large scale cloud-storage environment,and the data loads are balanced.Therefore,the model can be extended smoothly.