东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

面向文本拷贝检测的分布式索引

ISSN号：1003-0077
期刊名称：中文信息学报
时间：2011
页码：91-97
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]复旦大学计算机科学技术学院,上海201203
相关基金：国家自然科学基金资助项目（61073069 61003092）; 国家高技术研究发展计划（863计划）资助项目（2009AA01A346）
相关项目：结构化情感倾向表示与分析方法研究

关键词：拷贝检测, 重复检测, MAP-REDUCE, near duplicate detection, copy detection, Map-Reduce

中文摘要：

如何对大规模文档集进行高效的拷贝检测是长期以来一直受到研究者们关注的问题。通常的拷贝检测算法都需要借助倒排索引。因此良好的索引结构对于算法性能至关重要。同时,随着文档集规模的增大,单机实现的索引已经不能满足拷贝检测的需求,需要引入分布式存储的索引。为了适应文档集规模的不断增大,良好的分布式索引应该同时具备较高的效率和可扩展性。为此该文比较了两种不同的分布式索引结构,Term-Split索引和Doc-Split索引,并且给出了Map-Reduce范式下建立这两种索引的实现,以及以这两种索引为基础的文本拷贝检测方法,Term-Split方法和Doc-Split方法。在WT10G文档集上进行的实验表明Doc-Split方法具有更好的效率和可扩展性。

英文摘要：

How to effectively detect near duplicate documents on large corpus is a hot topic in recent years.Usually,near duplicate detection algorithms use Inverted Index to improve their efficiency.However,as the corpus size increases,single machine implementation of index structure is intractable.Therefore Distributed Index structure is required for near duplicate detection.To process rapidly increasing data size,the distributed index structures should have both high efficiency and scalability.In this paper,we compare two different distributed index structures,Term-Split Index and Doc-Split Index,and provide the Map-Reduce implementation.Based on those two index structures,we propose two different approaches,Term-Split Approach and Doc-Split Approach,to detect near duplicate documents using Map-Reduce paradigm.Finally,we compare the performance of the two different approaches on WT10G corpus.Experimental results show that the Doc-Split Approach is more efficient and has better scalability.

同期刊论文项目