大数据的存储与分析是近年来数据库领域研究的热点,高效的索引技术是提高大数据查询分析性能的重要技术手段。在现有的数据存储模型及索引技术研究基础上,提出使用MapReduce构建列存储数据的索引。该索引技术结合MapReduce编程模型,先在Map阶段完成数据划分,然后在Reduce阶段完成数据的排序,最后在数据有序的Reduce节点上创建RB+树索引,从而减少索引创建时因为RB+树内部节点递归分裂而产生的昂贵代价和树的高度,提高数据查询的性能。通过在真实数据集上进行实验,验证了所提出方法的有效性。
Huge data storage and analysis are the research focus of database field in recent years.Efficient index technology is an important technical means to improve the performance of huge data query and analysis.Based on existing studies on data storage model and index technology,we propose that to use MapReduce to create index for column-store data.In combination with the MapReduce programming model,this index technique first completes data partitioning in Map phase,then completes data sorting in Reduce phase,and finally creates RB+tree index on each sorted data Reduce node,so as to cut down the high cost caused by recursive split between inner nodes of RB+tree when the index is creating and the height of the tree,and to improve data query performance.Through the experiment on real log file datasets,it verifies the effectiveness of the proposed method.