数据流处理系统,无论是集中式还是分布式,都需要克服单点瓶颈问题.不仅如此,如果数据流处理系统是静态配置的,那么还会出现处理节点供给不足或者过剩的情况,为此本文提出了一种支持可扩展的并行分布式数据流处理系统—流水行云,该系统根据有状态算子将查询拓扑划分为并行处理的子查询,并且通过有状态算子的分发器和收集器实现了数据流的保序,同时最大化减少并行处理的通信开销,不仅如此,结合负载均衡和重配置的可扩展技术使得该系统能够根据输入负载动态调整处理节点的负载和个数.60个节点组成的集群的实验证明了该系统的可扩展能力.
The stream processing systems, whether centralized or disllibuted, have to overcome the single-node bottleneck. Moreover,their static configurations also make them either shortage or surplus of resources. To this end,this paper proposes a scal- able parallel-dislributed stream processing system named SPSPS. The system splits a query into parallel sub-queries according to stateful query operators to minimize the communication overhead in parallel processing, and achieves order-preserving tuple processing through the stateful operator's dislributor and collector. Moreover, the scalability techniques with load balancing and reconfigmration support effective adjustment of resources depending on the incoming load. The experiments on a cluster with 60 nodes prove the scalabilitv.