基于数据仓库的OLAP系统是当前海量多维数据分析的主要工具。随着信息技术的发展,海量多维数据的规模急剧增长,结构日益复杂,OLAP系统的性能严重下降,已经无法满足人们的数据分析需求。基于分布式计算系统Hadoop给出了新的海量多维数据的存储方法和查询方法。设计了HDFS上的列存储文件格式HCFile,基于HCFile给出了海量多维数据存储方案,该方案能够提高聚集计算效率,并有很好的可扩展性。同时,利用多维数据的层次性语义特征,设计了维层次索引,并给出了利用维层次索引和Map Reduce进行聚集计算的方法。通过和Hive的对比实验,表明了数据存储方案和查询方法能够有效提高海量多维数据分析的性能。
The OLAP(Online Analytical Processing) system built on warehouse is the most popular tool to analyze large-scale multidimensional data. With the development of information technology, data volume grows rapidly and data structure becomes more and more complicated, so the performance of OLAP system has dropped severely, failing to meet daily data analysis needs. This paper proposes new methods to store large-scale multidimensional data and perform aggregation query with Hadoop, a parallel computing system. The paper implements a new column-store format HCFile(HDFS column file), and proposals a new storage solution based on it. This project can improve the efficiency of aggregation,with a good scalability. Meanwhile, this paper leverages the hierarchy schema to build dimension hierarchy index, and uses Map Reduce to perform efficiency aggregation query. Through comparison experiments with Hive, it proves that the proposed storage solution and aggregation query can effectively improve the efficiency of large-scale multidimensional data analysis.