广泛用于数据密集型计算的MapReduce模型将计算部署到数据端并行执行,数据布局将不再只影响存储本身,还影响计算效率;节点上存储数据的特征决定该节点上任务的执行效率,负载均衡从传统的服务器管理或任务调度研究转变成为以提高并行性为目的的数据布局研究,为此,分析了数据密集型计算和MapReduce环境中数据布局的特点,提出了负载均衡的数据布局目标,并提出在特定环境下实现负载均衡的数据布局方法,最后通过实验证明了数据布局目标和数据布局方法的有效性.理论和实验结果证明,新提出的布局方法能有效地提高MapRe—duce应用的并行性,优化其执行效率.
Widely used in data-intensive computing, the MapReduce model deploys computing to the da- ta side so as to execute in parallel. On this occasion, data layout will not only affect the storage itself, but also affect the computing efficiency. Computing efficiency of node is determined by features of data stored on this node. Therefore, the study on load balancing is accordingly shifted from traditional server management or task scheduling to study of data layout as a purpose to improve parallelism. The data lay- out characteristics in data-intensive computing and MapReduee environment is analyzed, a load-balanced goal of data layout is proposed, and a load-balanced data layout approach in a specific environment is presented as well. The proposed data layout goal and approach are proved effective through experiments. It is shown that the proposed data layout approach can effectively improve the parallelism of MapReduee applications, thus optimizing the computing efficiency.