介绍了并行ETL的相关工作和常见的处理多MapReduce作业流程的方法;提出一种改进的链式MapReduce框架.并将此框架应用于一个并行ETL工具,同时提出一些针对ETL处理的流程级优化规则,使ETL流程产生更少的MapReduce作业,从而减少I/O以及网络传输的消耗;利用某省份手机上网数据与Hive进行了大数据对比实验,结果表明.本ETL工具的性能平均比Hive快10%~20%。
The related work in parallel ETL and common methods to deal with multiple MapReduce jobs were introduced. Then an improved chain-MapReduce framework was presented, based on this framework, a parallel ETL tool was designed. Several optimization rules on ETL which will make the ETL process generate less MapReduce jobs to avoid unnecessary I/O and network cost were presented. The ETL tool on real queries and real big datasets were evaluated. Compared with Hive, the tool reduces time on average by 10% to 20%.