随着数据规模的不断增长,大数据管理具有重要意义.在众多数学模型中,因为概率模型可以将海量数据抽象成少量概率数据,所以它非常适合管理大数据.因此,研究大数据环境下的概率数据管理具有重要意义.作为一种经典查询,基于概率数据的范围查询已被深入研究.然而,当前研究成果不适合在大数据环境下使用.其根本原因是这些索引的更新代价较大.该文提出了索引HGD-Tree解决这一问题.首先,该文提出了一系列算法降低新增数据的处理代价.它可以保证树结构平衡的前提下快速地执行插入、删除、更新等操作.其次,该文提出了一种基于划分的方法构建概率对象的概要信息.它可以根据概率密度函数的特点自适应地执行划分.此外,由于作者提出的概要是基于比特向量,上述策略可以保证索引以较低空间代价管理概率数据.最后,该文提出了一种基于位运算的方法访问HGD-Tree.它可以用少量的位运算执行过滤操作.大量的实验验证了算法的有效性.
With the increasing of data scale, big data management is great significant. Underlying the popular mathematical models, probabilistic model is suitable for big data management since it could compress volume of data into a few probabilistic data. Therefore, it is significant for studying the problem of probabilistic data management over big data environment. As a classic query, range query over probabilistic data has been fully studied. However, the state of art efforts are not suitable since they all suffer from highly updating cost. In this paper, we propose a novel index named HGD-Tree for solving this problem. First of all, we propose a group of novel strategies for handling newly arrival objects. In this way, we could efficiently apply the insertion, deletion, and updating on the premise of balancing tree structure. In addition, we propose a novel partition-based structure to approach the probability density function of object, where the structure could self-adjust the partition resolution so as to cater for the underlying of uncertain data. Besides, our proposed structure is expressed by a few bit vectors. The above two strategies guarantee low space cost of the proposed index. Last but not least, we propose a novel algorithm for supporting the range query which could effectively apply the analysis and extensive experimental results algorithms. pruning under few bitwise operations. Theoretical demonstrate the effectiveness of the proposed