东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

PAA：海量数据上一种有效的近似聚集查询算法

ISSN号：1000-1239
期刊名称：《计算机研究与发展》
时间：0
分类：TP311[自动化与计算机技术—计算机软件与理论;自动化与计算机技术—计算机科学与技术]
作者机构：[1]哈尔滨工业大学计算机科学与技术学院,哈尔滨150001
相关基金：国家“九七三”重点基础研究发展计划基金项目（2012CB3i6200）;国家自然科学基金项目（61190115,61173022,61033015,60831160525,61272046,60903016）;哈尔滨工业大学科研创新基金项目（HIT.NSRIF.2014136）

关键词：海量数据, PAA算法, 近似聚集, 划分, 随机样本, massive data, PAA, approximate aggregation, partition, random sample

中文摘要：

聚集查询是一种常用但是耗时的数据库操作．相对于准确查询，以少得多的响应时间向用户返回满足置信区间的近似结果通常是一种更好的选择．现有的近似查询方法无法在海量数据上高效地处理满足任意精度的近似聚集查询．提出一种新的算法PAA（partition-based approximate aggregation）来有效处理满足任意置信区间的近似聚集．维属性的数据空间被划分为同样大小的空间区域，每个分片维护着维属性落入对应空间区域的元组．PAA算法维护表的随机样本RS，其执行包括两个阶段．在阶段1，如果利用预构建的随机样本RS不能返回满足用户要求的近似结果，那么在阶段2，PAA算法从与查询区域相交的空间区域对应的分片集合IPS中获得更多的随机元组．PAA算法的特色在于：1）如何在不知道IPS包含的每个分片满足谓词的元组数量情况下，从IPS中获得需要的随机元组；2）如何有效减少阶段2中的随机I/O费用．实验表明，相对于现有方法，PAA算法可以获得两个数量级的加速比．

英文摘要：

Aggregation is a commonly used but time-consuming operation in database systems. Relative Compared to exact query, it is often more attractive to return an approximate result with the required error bound to user in a much faster response time. However, we find that none of the previous methods can process approximate aggregation on massive data with arbitrary accuracy and high efficiency. A novel algorithm PAA is proposed to efficiently process approximate aggregation with an arbitrary confidence interval. The data space of dimensional attributes is divided into multiple hypercubes of the same cube size. Each partition maintains the tuples whose dimensional attributes fall into the corresponding hypercube. A random sample RS is pre-constructed on table. PAA consists of two stages. If the approximate result obtained by RS in stage 1 does not satisfy the confidence interval, it is required to retrieve more random tuples from partition set IPS whose hypercubes overlap with search region in stage 2. The novelty of PAA lies in how to retrieve random tuples from IPS when the exact number of tuples satisfying predicate in each partition is unknown and how to reduce random I/O cost of retrieval operation as much as possible. The experimental results show that PAA obtains up to two orders of magnitude speedup compared with the existing methods.

同期刊论文项目