MapReduce是目前广泛应用的并行计算框架,是Hadoop平台的重要组成部分。主要包括Map和Reduce函数,Map函数输出key-value键值对作为Reduce的输入。由于输入的动态性,不同主机上的Reduce处理的输入量存在不均衡性。如何解决Reduce的负载均衡是优化MapReduce的一个重要研究方向。对整体数据进行抽样,通过适量的样本分析数据,达到较小的代价获得可靠的key分布,提出贪心算法代替Hadoop平台默认的Hash算法来划分数据,实现Reduce负载均衡。提出的贪心算法主要思想是根据抽样数据,求取所有key频次的和对于Reduce节点数量的平均值,然后依次为每一个Reduce分配一个接近平均值的负载,从而达到整体的负载均衡。模拟实验表明,所提算法与默认的hash分区算法相比,运行时间节约10.6%,达到更好的负载均衡。
MapReduce is used wildly as a parallel computing framework, mainly including the Map function and Reduce function.Map function has the output of the key-value pairs, which are the input of the Reduce function. As a result, the input of Reduce is dynamic. The load balancing of Reduce hosts has an important impact on the efficiency of MapReduce. This paper sampled the overall data. The aim was to obtain reliable key distribution at a cheap price. Then, it proposed a greedy algorithm to divide data to achieve Reduce load balancing, taking the place of hash algorithm. The main idea of the proposed greedy algorithm was to assign the right job to a Reduce host for the best load balancing in each step. Simulation results show that proposed algorithm has better performance than the other two algorithms. Compared with the default Hash partitioning algorithm,the proposed algorithm decreases the running CPU time by 10. 6% , and achieves better load balancing.