随着在线分析连续数据流的需求日益增多,用于实时处理海量、易变数据的数据流管理系统由此产生.大数据时代下,随着开放式处理平台的发展,为处理大规模且多样化的数据流,出现了若干分布式流处理系统,如S4、Storm、Spark Streaming等.然而,为提升处理系统的易用性和处理能力,需要在其之上构建具有抽象查询语言的关系查询系统,以构筑完整的分布式数据流管理系统.如何设计并实现高效易用的关系查询系统是一个亟待解决的问题.文中首先概述了分布式数据流查询处理的典型应用、数据特征和实现目标.进而,提出了分布式数据流关系查询系统的基础架构,并基于此架构深入分析了用户自定义函数查询、查询优化、驱动方式、编译技术、算子管理、调度管理和并行管理等关键技术.然后,对比分析了SPL、StreamingSQL、Squall和DBToaster这4种具有代表性的查询系统实例.最后,指明了该技术在优化技术、执行策略、实时精准查询和复杂查询分析等方面所面临的挑战和今后的研究工作.
The applications that require online processing continuous data stream are increasing. Data stream management systems which are used to deal with massive and variable data in real time have been produced. With the development of open processing platforms in the ear of big data, a number of distributed data stream processing systems have emerged for dealing with large scale and diverse data stream, such as s4, Storm, Spark Streaming, etc. However, we should construct relational query systems which have abstract query language on basis of the processing systems for improving the ease of use and processing capability of them, so as to build complete distributed data stream management systems. How to design and realize the high efficiency and easy-to-use query systems is a great challenge. In this survey, we first provide an overview of typical applications, data characteristics and achieve goals of distributed data stream query processing. Furthermore, we propose the framework of distributed data stream relational query systems. Based on the framework, we analyze the key techniques in several aspects. UDF query, query optimization, query-driven approaches, compiling techniques, operator management, scheduling management and parallel management. Then, there is the comparison of representative query systems including SPL, StreamingSQL, Squall and DBToaster. Finally, some new challenges are put forward, including optimization technique, execution strategy, real-time precise query and complex query analysis.