top-k查询是一种被广泛应用的操作,通过把已有top-k算法作为分析和研究的基础,根据现有算法所存在的不足提出自己的解决方案。提出SRTA( Sequential-Read Threshold Algorithm),相比NRA算法对数据的存储进行了重新的规划,创建一个新的表将内存上的开销转换到较廉价的外存开销,只需顺序读取就可以进行有效的top-k查询,同时将表进行了划分,在并行处理的情况下更能提高程序的效率,能够很好地运行在内存有限的环境中。在SRTA基础上提出的DSRTA(Distributed Sequential-Read Threshold Algorithm),适用于分布式环境中。 DSRTA先采用ID划分的方式把原有数据集划分为多个子空间,然后再进行数据规划,发挥分布式的性能优势,进一步提高了SRTA的查询效率。
Top-k query is a widely used operation. This paper took the existing algorithms as the basis of analysis and research, and put forward solutions to solving the problems of the existing algorithms. Compared with the NRA ( No Random Access) algorithm, the SRTA ( Sequential-Read Threshold Algorithm) which proposed in this paper replanted the data storage mode, which created a new table to switch the memory overhead to the cheaper external memory overhead, so just sorted access was also able to do efficient top-k query. Meanwhile, the table was divided, which made the algorithm more efficient and smoother even with limited memory, in the case of parallel processing. DSRTA ( Distributed SRTA) algorithm applies to the distributed environment, which is designed on the basis of SRTA. The original data set was divided into more than one spaces in the way of ID division by DSRTA, and then replanted the data storage mode. By taking advantages of the distributed system performance, the query efficiency of SRTA was further improved.