在许多应用领域中,top-k查询是一种十分重要的操作,它根据给定的评分函数在潜在的巨大的数据空间中返回k个最重要的对象.不同于传统的TA算法,NRA算法只需要顺序读就可以处理top-k查询,从而适合于随机读受限或不可能的场合.文中详细地分析了NRA算法的执行行为,确定了增长阶段和收缩阶段中每个文件需要扫描的元组个数.文中发现在海量数据环境中,NRA在增长阶段需要维护大量的候选元组,严重影响了算法的执行效率.所以,文中提出一种新的海量数据上的top-k查询算法TKEP,该算法在查询的增长阶段就执行早剪切,从而大大减少增长阶段需要维护的候选元组.文中给出了早剪切操作的数学分析,确定了早剪切操作的理论和实际剪切效果.据作者所知,该文是第一篇提出在top-k查询的增长阶段执行早剪切的文章.实验结果表明,和传统的NRA相比,TKEP在增长阶段维护的元组数量减少3个数量级,需要的内存量减少1个数量级,TKEP算法获得1个数量级的加速比.
In many application fields,top-k is an important operation since it returns k most important objects according to a given ranking function.Different from traditional TA algorithms,NRA only requires sequential access to return top-k results so that it can be used in environment where random access is limited or impossible.This paper analyzes the execution behavior of NRA and determines tuple number to scan in increasing and shrinking phase.It is found that in massive data context,NRA needs to maintain large quantity of candidate tuples in increasing phase which affects algorithm efficiency significantly.This paper proposes a novel top-k algorithm TKEP(Top-K with Early Pruning) on massive data which performs early pruning in increasing phase to prune most of candidate tuples.This paper provides mathematical analysis of early pruning and proves its theoretical and practical pruning effect.To the best of our knowledge,it is the first paper to provide early pruning in top-k processing.The extensive experiments show that compared to NRA,TKEP maintains less tuples by a factor of three orders of magnitude,it consumes less memory by a factor of an order of magnitude and TKEP achieves substantial performance speed-up of an order of magnitude.