MapReduce是目前最为流行的用于大数据分析的并行系统之一.许多企业已经搭建了自己的MapReduce集群,为广大用户提供计算服务.用户可以向集群提交具有完成时限要求的MapReduce作业,若作业被按时完成,则企业可以获得一定的收益.针对这种应用场景,该文首次提出了MapReduce集群中的最大收益问题.为有效地解决该问题,首先提出了一种基于序列的任务调度策略(简称为SEQ策略),并证明了在处理具有完成时限约束的作业时SEQ策略存在优势.基于SEQ策略,该文提出了最大收益的调度算法(Scheduling Algorithm for Maximum Benefit,简称AMB算法),该算法可以快速地确定可接收作业,并给出有效的执行方案,以达到最大化收益的目的.另外,针对在实际应用中的某些异常情况(如节点宕机),该文也设计了有效的超时处理策略,进一步增加了算法的实用性.最后,通过大量的实验验证了该文所提出算法的有效性.
MapReduce is one of the most popular parallel systems for big-data analysis.Many companies have built their MapReduce clusters to provide computing services to users.Users can submit their deadline-constraint MapReduce jobs to the cluster.If the jobs are finished before their deadlines,the company can get some benefits.For this application scenario,the maximum benefit problem in a MapReduce cluster is firstly presented in this paper.To solve this problem effectively,a sequence-based task scheduling strategy(SEQ strategy for short)is proposed,and we prove the advantages of SEQ strategy for the deadline-constraint job processing.Based on SEQ strategy,a novel Algorithm for Maximum Benefit,AMB,is proposed.AMB can efficiently determine the acceptable jobs and provide the effective execution strategy which can maximize the benefit.Besides,for the exceptions(e.g.node failure)in practical applications,a timeouthandling method is proposed,which can further improve the practicality of the algorithm.At last,the effectiveness of the proposed algorithm is verified through plenty of experiments.