东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于Hadoop的SQL查询引擎性能研究

ISSN号：1000-1190
期刊名称：《华中师范大学学报：自然科学版》
时间：0
分类：TP311[自动化与计算机技术—计算机软件与理论;自动化与计算机技术—计算机科学与技术]
作者机构：[1]武汉大学计算机学院,武汉430072, [2]英特尔英特尔亚太研发中心,上海201100
相关基金：国家自然科学基金项目（61272112;61472287）; 湖北省自然科学基金重点项目（2015CFA068）

作者：吴黎兵[1], 邱鑫[1,2], 叶璐瑶[1], 王晓栋[2], 聂雷[1]

关键词：大数据, SQL-on-Hadoop, 数据仓库, SPARK, SQL, Impala, Hive, big data, SQL-on-Hadoop, data warehouse, Spark SQL, Impala, Hive

中文摘要：

Apache Hadoop处理超大规模数据集有非常出色的表现,相比较于传统的数据仓库和关系型数据库有不少优势.为了让原有业务能够充分利用Hadoop的优势,SQL-on-Hadoop系统越来越受到工业界和学术界的关注.基于Hadoop的SQL查询引擎种类繁多,各有优势,其运算引擎主要包括三种：1传统的Map/Reduce引擎;2新兴的Spark引擎;3基于shared-nothing架构的MPP引擎.本文选取了其中最有代表性的三种SQL查询引擎—Hive、Spark SQL、Impala,并使用了一种类TPC-H的测试基准对它们的决策支持能力进行测试及评估.从实验结果来看,Impala和Spark SQL相对于传统的Hive都有较大的提高,其中Impala的部分查询比Hive快了10倍以上,并且Impala在完成查询所占用的集群资源也是最少的.然而若从稳定性、易用性、兼容性和性能等多个方面进行对比,并不存在各方面均最优的查询引擎,因此在构建基于Hadoop的数据仓库系统时,推荐采用Hive＋Impala或者Hive＋Spark SQL的混合架构.

英文摘要：

Hadoop has huge advantage over traditional data warehouse and RDBMs on storing and processing large amount of data.In order to be compatible with existing business logic,SQL-on-Hadoop systems are getting more and more attentions from both industry and academia.There are variable kinds of SQL-on-Hadoop systems with different architectures and different execution engines.Those systems are generally divided into three categories：traditional engines based on Map/Reduce,newborn engines based on Spark,and MPP engines based on shared-nothing architecture.In this paper,three SQL-on-Hadoop systems,Hive,Spark SQL and Impala,are chosen to represent each category,respectively.A TPC-H like workload is used to benchmark the efficiency and resource usage for each system.Through detailed analysis of the experimental result,both Impala and Spark SQL are faster than Hive.In some particular queries,Impala is10 Xfaster than Hive with minimum CPU/RAM usage among the three SQL systems.However,when compared in terms of stability,usability,compatibility and performance,no one can beat others at all aspects.So while building the data warehouse system based on Hadoop,it is recommended to use a hybrid architecture using Hive＋Impala or Hive＋Spark SQL.

同期刊论文项目