MapReduce以其简洁的编程模型,被广泛应用于大规模和高维度数据集的处理,如日志分析、文档聚类和其他数据分析.开源系统Hadoop很好地实现了MapReduce模型,但由于自身采用一次分区机制,即通过Hash/Range分区函数对数据进行一次划分,导致在处理密集数据时,Reduce端常会出现数据倾斜的问题.虽然系统为用户提供了自定义分区函数方法,但不幸的是在不清楚输入数据分布的情况下,数据倾斜问题很难被避免.为解决数据划分的不均衡,该文提出一种将分区向Reducer指派时按照多轮分配的分区策略.该方法首先在Map端产生多于Reducer个数的细粒度分区,同时在Mapper运行过程中实时统计各细粒度分区的数据量;然后由JobTracker根据全局的分区分布信息筛选出部分未分配的细粒度分区,并用代价评估模型将选中的细粒度分区分配到各Reducer上;依照此方法,经过多轮的筛选、分配,最终在执行Reduce()函数前,将所有细粒度分区分配到Reduce端,以此解决分区后各Reducer接收数据总量均衡的问题.最后在Zipf分布数据集和真实数据集上与现有的分区切分方法Closer进行了对比,增量式分区策略更好地解决了数据划分后的均衡问题.
MapReduce has been widely used in processing large data sets in a distributed cluster as a flexible computation model, such as log analysis, document clustering and other forms of data analytics. In the MapReduce open-source platform Hadoop, the default Hash/Range partition scheme usually results in unbalanced data load in the Reduce phase. Even though Hadoop allows users to define a partition function, it is difficult to achieve balanced data load without detailed information on data distribution. In this paper, we propose a novel multiple-round approach to balance data load in the Reduce phase. In our proposal, Mapper produces more fine-grained partitions than the number of Reducer and gathers the statistics on the sizes of fine-grained partitions. And then, JobTracker selects appropriate fine-grained partitions to be allocated to Reducers before running Reduce ( ) function. We introduce a cost model and propose a heuristic assignment algorithm for this task. Finally, we experimentally compare our approach with Closer, which uses a segment partition method, on both synthetic and real datasets. The experimental results show our method achieves more balanced data load.