Apache Hadoop处理超大规模数据集有非常出色的表现,相比较于传统的数据仓库和关系型数据库有不少优势.为了让原有业务能够充分利用Hadoop的优势,SQL-on-Hadoop系统越来越受到工业界和学术界的关注.基于Hadoop的SQL查询引擎种类繁多,各有优势,其运算引擎主要包括三种:1传统的Map/Reduce引擎;2新兴的Spark引擎;3基于shared-nothing架构的MPP引擎.本文选取了其中最有代表性的三种SQL查询引擎—Hive、Spark SQL、Impala,并使用了一种类TPC-H的测试基准对它们的决策支持能力进行测试及评估.从实验结果来看,Impala和Spark SQL相对于传统的Hive都有较大的提高,其中Impala的部分查询比Hive快了10倍以上,并且Impala在完成查询所占用的集群资源也是最少的.然而若从稳定性、易用性、兼容性和性能等多个方面进行对比,并不存在各方面均最优的查询引擎,因此在构建基于Hadoop的数据仓库系统时,推荐采用Hive+Impala或者Hive+Spark SQL的混合架构.
Hadoop has huge advantage over traditional data warehouse and RDBMs on storing and processing large amount of data.In order to be compatible with existing business logic,SQL-on-Hadoop systems are getting more and more attentions from both industry and academia.There are variable kinds of SQL-on-Hadoop systems with different architectures and different execution engines.Those systems are generally divided into three categories:traditional engines based on Map/Reduce,newborn engines based on Spark,and MPP engines based on shared-nothing architecture.In this paper,three SQL-on-Hadoop systems,Hive,Spark SQL and Impala,are chosen to represent each category,respectively.A TPC-H like workload is used to benchmark the efficiency and resource usage for each system.Through detailed analysis of the experimental result,both Impala and Spark SQL are faster than Hive.In some particular queries,Impala is10 Xfaster than Hive with minimum CPU/RAM usage among the three SQL systems.However,when compared in terms of stability,usability,compatibility and performance,no one can beat others at all aspects.So while building the data warehouse system based on Hadoop,it is recommended to use a hybrid architecture using Hive+Impala or Hive+Spark SQL.