随着移动互联网和物联网的飞速发展,数据规模呈爆炸性增长态势,人们已经进入大数据时代。MapReduce是一种分布式计算框架,具备海量数据处理的能力,已成为大数据领域研究的热点。但是MapReduce的性能严重依赖于数据的分布,当数据存在倾斜时,MapReduce默认的Hash划分无法保证Reduce阶段节点负载平衡,负载重的节点会影响作业的最终完成时间。为解决这一问题,利用了抽样的方法。在用户作业执行前运行一个MapReduce作业进行并行抽样,抽样获得key的频次分布后结合数据本地性实现负载均衡的数据分配策略。搭建了实验平台,在实验平台上测试WordCount实例。实验结果表明,采用抽样方法实现的数据划分策略性能要优于MapReduce默认的哈希划分方法,结合了数据本地性的抽样划分方法的效果要优于没有考虑数据本地性的抽样划分方法。
With the rapid development of mobile Intemet and the Internet of Things, the data size explosively grows, and people have been in the era of big data. As a distributed computing framework, MapReduce has the ability of processing massive data and becomes a focus in big data. But the performance of MapReduce depends on the distribution of data. The Hash partition function defaulted by MapReduce can' t guarantee load balancing when data is skewed. The time of job is affected by the node which has more data to process. In order to solve the problem, sampling is used. It does a MapReduce job to sample before dealing with user' s job in this paper. After learning the distribution of key,load balance of data partition is achieved using data locality. The example of WordCount is tested in experimental plat- form. Results show that data partition using sample is better than Hash partition, and taking data locality is much better than that using sample but no data locality.