单机运行环境难以满足基于海量数据的大图算法对时空开销的需求,如何设计高效的面向云计算环境的分布式大图算法越来越受到人们的关注,MapReduce作为云计算的核心计算模式受限于易并行(EP)计算模型的制约不易表达图算法.文中突破了MapReduce基于易并行计算的假设,增强了MapReduce既有的编程规范,新的大同步(BSP)计算模型既能保证兼容旧的MapReduce作业可以无改动的运行,同时引入消息传递机制允许变化的状态数据在并行任务的超级步间进行交互.系统提供高度灵活的消息自定义接口,针对不同应用需求设计了轻量级和重量级两种自适应的消息传递机制,更高效地支持有数据交互需求的包含迭代处理的一大类图算法.在真实大规模图数据集上的实验结果表明,相比于原始的MapReduce作业外部链式处理,该文提出的BSP模型下的内部超级步迭代计算模式大幅降低了大图算法的处理时间.
Since analyzing large-scale graph is usually difficult to be implemented on a single machine,how to design efficient parallel large-scale graph algorithms is receiving more and more attention.Constrained by embarrassingly parallel assumption,parallel graph algorithms are not easy to express in MapReduce.Inspired by Bulk Synchronous Parallel model,we propose a message-enhanced version of Hadoop MapReduce that breaks its key assumption.Enhanced implementation is compatible with original Hadoop MapReduce,existing Hadoop MapReduce programs can run directly on this platform without modification,and uses message passing mechanisms to facilitate interactive data communication between supersteps of tasks.It also provides a highly flexible self-defined message passing interface and two adaptive message passing mechanisms to support efficient implementation of graph algorithms with data transition and iterative computation.The experimental results on the real Stanford large network dataset collection demonstrate the superiority of enhanced version over original Hadoop MapReduce on PageRank algorithm.