集群的能量消耗已经超过了其本身的硬件购置费用,而大数据处理需要大规模的集群耗费大量时间,因此如何进行能量高效的大数据处理是数据拥有者和使用者亟待解决的问题,也是对能源和环境的一个巨大挑战.现有的研究一般通过关闭部分节点以减少能量消耗,或者设计新的数据存储策略以便实施能量高效的数据处理.通过分析发现即便使用最少的节点也存在很大的能源浪费,而新的数据存储策略对于已经部署好的集群会造成大规模的数据迁移,消耗额外的能量.针对异构集群下I/O密集型的大数据处理任务,提出一种新的能量高效算法MinBalance,将问题分为节点选择和负载均衡两个步骤.在节点选择阶段采用4种不同的贪心策略,充分考虑到节点的异构性,尽量选择最合适的节点进行任务处理;在负载均衡阶段对选择的节点进行负载均衡,以减少各个节点因为等待而造成的能量浪费.该方法具有通用性,不受数据存储策略的影响.实验表明MinBalance方法在数据集较大的情况下相对于传统关闭部分节点的方法可以减少超过60%的能量消耗.
It is reported that the electricity cost to operate a cluster may well exceed its acquisition cost, and the processing of big data requires large scale cluster and long period. Therefore, energy efficient processing of big data is essential for the data owners and users, and it is also a great challenge for the energy use and environment protection. Existing methods powered down some nodes to reduce energy consumption or developed new strategies of data storage in the cluster. However, we can find that much energy is still wasted even minimal nodes are used to process the task, and new storage strategies do not suit for the deployed clusters for the extra cost of data transformation. In this paper, we propose a novel algorithm MinBalance to processing I/O intensive big data tasks energy efficiently in heterogeneous cluster. The algorithm can be divided into two steps, node selection and workload balance. In the former step, four greedy policies are used to select the proper nodes considering heterogeneity of the cluster. While in the latter step, the workloads of the selected nodes will be well balanced to avoid the energy wastes caused by waiting. MinBalance is a universal algorithm and cannot be affected by the data storage strategies. Experimental results indicate that MinBalanee can achieve over 60% energy reduction for large data sets over the traditional methods of powering down partial nodes.