东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

Top-k相似连接算法性能优化

ISSN号：1000-9825
期刊名称：《软件学报》
时间：0
分类：TP311[自动化与计算机技术—计算机软件与理论;自动化与计算机技术—计算机科学与技术]
作者机构：东华大学计算机科学与技术学院,上海201620
相关基金：国家自然科学基金（61370205）; 上海市自然科学基金（13ZR1400800）; 中央高校基本科研业务费专项资金

关键词： Top-k相似连接, 事件驱动框架, Token批处理, 哈希查找优化, top-k similarity join, event-driven framework, token batch processing, hash lookupoptimization

中文摘要：

相似连接算法在数据清理、数据集成和重复网页检测等领域有着广泛的应用.现有相似连接算法有两种类型：基于相似度阈值的相似连接和Top-k相似连接.Top-k连接算法非常适合于相似度阈值未知的应用场景,目前最为有效的Top-k相似连接算法是Xiao等人提出的Topk-join.为了解决Topk-join中存在的性能问题,提出了一种Top-k相似连接算法Opt-join,该算法将Token批处理技术集成在现有的事件驱动框架中,以降低前缀事件的处理代价;通过置换哈希查找与过滤操作的执行位置来降低哈希查找代价,并理论证明了该置换的正确性.实验结果表明：与Topk-join算法相比,Opt-join取得了1.28倍~3.09倍的性能提升.实验数据还显示：随着数据长度的增加或k值的增长,Opt-join的性能优势有不断增加的趋势.

英文摘要：

Similarity join is widely used in data cleaning, data integration and the detection of near duplicate Web pages. Existing similarity join algorithms fall into two categories： Threshold-based similarity join and Top-k similarity join. Top-k similarity join is suitable for applications in which the threshold is unknown in advance. The most efficient Top-k similarity join algorithm is Top-k-join, which is proposed by Xiao et al. In order to resolve the performance problemsof Topk-join, a novel Top-k similarity join algorithm Opt-join is proposed in this paper. By integrating the token batch processing technique into the existing event-driven framework, Opt-join reduces the cost of processing the prefix events. In addition, Opt-joinreduces the cost in hash lookup by switching the positions of the hash lookup and filtering operations. The correctness of the new algorithm is proved. Experimental results show that 1.28x-3.09xspeed-up is achieved by Opt-join compared with Topk-join. More importantly, with the increase of the record length or the k value, Opt-join surpasses Topk-join by a larger margin.

同期刊论文项目

数据广播环境下实时调度理论及应用研究

期刊论文 3

同项目期刊论文

Time-ordered collaborative filtering for news recommendation

多核平台下Esper数据流管理系统性能分析研究

期刊信息

《软件学报》
北大核心期刊（2011版）

主管单位:中国科学院
主办单位:中国科学院软件研究所中国计算机学会
主编：赵琛
地址：北京8718信箱中国科学院软件研究所
邮编：100190
邮箱：jos@iscas.ac.cn
电话：010-62562563

国际标准刊号：ISSN：1000-9825
国内统一刊号：ISSN：11-2560/TP
邮发代号:82-367

获奖情况:
2001年入选中国期刊方阵“双百期刊”,2000年荣获中国科学院优秀科技期刊一等奖

国内外数据库收录:
俄罗斯文摘杂志,美国数学评论（网络版）,波兰哥白尼索引,德国数学文摘,荷兰文摘与引文数据库,美国工程索引,美国剑桥科学文摘,英国科学文摘数据库,日本日本科学技术振兴机构数据库,中国中国科技核心期刊,中国北大核心期刊（2004版）,中国北大核心期刊（2008版）,中国北大核心期刊（2011版）,中国北大核心期刊（2014版）,中国北大核心期刊（2000版）

被引量:54609