大数据处理过程中产生的高能耗问题亟待解决,尤其是在数据量规模剧增的背景下。在对已有数据布局策略存在问题分析的基础上,分析了与基于存储区域划分的节能模式及与异构HDFS集群的不适应、数据块切分算法不灵活、存储节点选择的随机性等几个方面的问题,继而提出面向节能的MapReduce数据布局策略。首先,新策略适应将集群划分为不同存储区域(Active-Zone与Sleep-Zone)的节能模式;其次,新策略对传统的数据块数计算方法进行了改进,提出作业截止时间约束下的最小任务数计算方法确定数据块数量;最后,新的存储策略增加了对异构集群环境的适应能力,并能根据不同的作业类型进行存储节点的选择。实验结果表明:新的数据布局策略能够适应异构集群环境,达到减小MapReduce作业能耗的目的。
The problem of high energy consumption producing from big data processing is an important issue that needs to be solved,especially under the background of data explosion. Based on analyzing problems of the existing data layout policy,the problems of the in adaptation of energy-saving mode based on storage area division and heterogeneous HDFS cluster,the inflexibility of data block segmentation algorithm,the randomness of storage node selection,proposing a data layout strategy orienting to energy conservation are analyzed. Firstly,the new strategy divides the cluster into two different storage areas to meet the needs of saving energy: Active-Zone and Sleep-Zone; secondly,the new strategy has made im-provements on traditional data block computing method,proposes a minimum number of jobs calculation method to determine the number of data blocks; at last,the new strategy can increase the adaptability of the heterogeneous cluster environment and can choose the appropriate storage nodes according to different job types. Experimental results show that the new data layout strategy can adapt to the heterogeneous cluster environment and reach the goal of reducing energy consumption for MapReduce jobs.