如何对大规模文档集进行高效的拷贝检测是长期以来一直受到研究者们关注的问题。通常的拷贝检测算法都需要借助倒排索引。因此良好的索引结构对于算法性能至关重要。同时,随着文档集规模的增大,单机实现的索引已经不能满足拷贝检测的需求,需要引入分布式存储的索引。为了适应文档集规模的不断增大,良好的分布式索引应该同时具备较高的效率和可扩展性。为此该文比较了两种不同的分布式索引结构,Term-Split索引和Doc-Split索引,并且给出了Map-Reduce范式下建立这两种索引的实现,以及以这两种索引为基础的文本拷贝检测方法,Term-Split方法和Doc-Split方法。在WT10G文档集上进行的实验表明Doc-Split方法具有更好的效率和可扩展性。
How to effectively detect near duplicate documents on large corpus is a hot topic in recent years.Usually,near duplicate detection algorithms use Inverted Index to improve their efficiency.However,as the corpus size increases,single machine implementation of index structure is intractable.Therefore Distributed Index structure is required for near duplicate detection.To process rapidly increasing data size,the distributed index structures should have both high efficiency and scalability.In this paper,we compare two different distributed index structures,Term-Split Index and Doc-Split Index,and provide the Map-Reduce implementation.Based on those two index structures,we propose two different approaches,Term-Split Approach and Doc-Split Approach,to detect near duplicate documents using Map-Reduce paradigm.Finally,we compare the performance of the two different approaches on WT10G corpus.Experimental results show that the Doc-Split Approach is more efficient and has better scalability.