在数据中心中,类MapReduce的分布式计算系统在数据的混洗阶段产生巨大流量,令数据中心的东西向网络资源成为瓶颈.将这些高度相关的数据流在接收端进行聚合是分布式计算的通用处理方式,为了降低网络通信量并有效利用带宽,文中采用网内关联性流量的汇聚传输策略,将混洗和汇聚并行化,达到进一步降低东西向网络资源消耗、缩短混洗阶段延迟的目的.目前提出的IRS-based算法在适用场景上有一定局限性,为了解决这一问题,文中首先在以服务器为中心的代表结构BCube上建立incast最小树模型,分别提出MIB-based算法和MC-based算法,仅根据已知拓扑结构和发送节点编号即可快速生成一棵近似的最小代价incast树.MIB-based算法针对发送节点强关联的情况,使高层发送节点尽可能汇聚到已有的低层发送节点构建incast树;MC-based算法针对发送节点松散关联的情况,将节点进行最大程度上的聚合,通过增加最少的汇聚点完成incast树的构建.随后将上述两种算法结合起来进一步提出适用于各种场景的M2-based算法,通过推算时间复杂度证明该算法能够满足在线构建incast树的需求.最后,详细分析了M2-based算法对其他数据中心网络结构的适应性以及网内汇聚传输能够减少作业完成时间的原理.小规模实验结果表明,在不同网络规模下,M2-based比IRS-based节省了网络中约3%的数据量,整个作业在混洗和Reduce阶段的等待时间比不采用网内汇聚缩短约2/3;在不同传输节点规模下,M2-based比IRS-based节省了网络中约19%的数据量,整个作业在混洗和Reduce阶段的等待时间比不采用网内汇聚缩短约3/4.
In data centers, distributed computing systems like MapReduce produce massive amount of traffic across successive processing stages. Such shuffle transfers make east-west network resource become a bottleneck. In many commonly used workloads, data flows from all senders to each receiver are typically highly correlated. Many state-of-the-practice systems thus already apply aggregation functions at the receiver side of a shuffle transfer to reduce the output data size. To lower down the network traffic and efficiently use network bandwidth, we introduce in-network aggregation for associated traffic and parallelize the shuffle and reduce phases. It can significantly reduce consuming the rare east-west network resource, and avoid long latency time produced by the shuffle phase in MapReduce jobs. IRS-based algorithm proposed currently has certain limitations. To solve this problem, we first built a model for incast minimal tree with BCube, a representative server-centric networking structure for future data centers, and propose two approximate incast tree construction methods named MIB-based and MC-based, respectively, solely based on the labels senders and the data center topology. MIB-based method is applied to the case of highly correlative senders. It can build an incast minimal tree by making an endeavor to aggregate the high-level senders to low-level senders. MC-based method is applied to the case of loose associative senders. It can build an incast minimal tree by aggregating nodes as far as possible and increasing the least nodes. Then we combined two methods and further proposed M2-based method for any case. It proved that the method we proposed can meet the demand of building the incast tree on line by calculating the time complexity of the M2-based incast tree building method. At last, we analyzed the adaptability of M2-based to other data center structures, and the principle of in-network aggregation in reducing the job execution time. The small-scale experimental results show that, in the different size o