Global Skyline查询是Skyline查询的一种变种,它和动态Skyline查询、反Skyline查询关系密切,已被广泛应用于多目标决策、网络监控、数据挖掘等方面。随着数据的积累,传统集中式的Skyline查询已经不能满足大数据的处理要求。为了高效解决大规模的基于时间序列的数据处理难题,提出了基于MapReduce框架并行的Global Skyline Cell查询算法。首先,通过对实际应用需求进行分析,本文提出了基于时间序列数据Skyline查询的时间倒排索引模型;并提出了Global Skyline格概念,利用格间的支配关系进行粗粒度高效剪枝,避免了大部分的无效运算;其次查询点将数据空间分割成不同象限,基于各象限进行轮询,实现了Global Skyline格的查询,在此候选结果中得到Global Skyline点,为下一步实现动态Skyline和反Skyline查询奠定基础。最后,我们在Hadoop集群环境中实现了该算法。实验结果表明,该算法能有效解决基于时间序列的大规模数据Skyline查询的时间和空间矛盾,能够满足实际应用需求。
Global Skyline query is a variant of the Skyline query which has been used for multiple objective decision making,business planning,network monitoring and data mining etc.The result set of Global Skyline query is close to the ones of dynamic Skyline query and reverse Skyline query.With the number of historical data increases,Skyline query on centralized system is not competent for big data and Skyline query for large-scale data on time series is a challenge.A parallel algorithm of Global Skyline on time series is proposed.Firstly,we present a inverted index based on data on time series.Secondly,we provide the concept of Global Skyline cell which can eliminate the dominated cells according to the cell dominance relationship.The coarse grained pruning strategy can help to avoid a lot of meaningless computation.The query point divides the data space into the four quadrants,Global Skyline query can be executed in eachquadrant circularly.Lastly through extensive experiments with both real-world and synthetic datasets,we show that our algorithm is much more efficient for big data on time series.