基数计算在流数据查询优化、网络安全、数据压缩等领域具有重要的应用价值。现有的基于概率统计原理的基数估计算法需要通过扫描历史静态数据才能进行基数统计,由于流数据具有持续、快速和实时等特点,不可能先持久化再处理分析,因而传统的基数估计算法无法直接应用在大数据流处理中。通过研究Spark、Storm实时分布式流处理机制和传统基数估计算法,设计和实现了实时的流数据基数估计算法SHELL(Streaming HypErLogLog),实验表明,SHELL在保证精确度不降低的情况下,单位滑动时间窗口内处理的消息量达到6.0×10^5~6.8×10^5,满足实时性处理的要求。
Cardinality estimation has an important application value in the fields of stream-data query optimization,network security,data compression and so on. Some existing probabilistic algorithms are developed to estimate the cardinality by scanning the static historical data. Due to the infinite,fast,real-time characteristics of data stream,the algorithms cannot be applied to an infinite data stream. By studying streaming data-process mechanisms of Spark,Storm and existing probabilistic algorithms,a real-time cardinality evaluation algorithm,Streaming Hyp Er Log Log( SHELL),for stream data is designed and implemented. Experimental results show that SHELL can achieve 6. 0 × 105-6. 8 × 105 messages in one sliding time window. Therefore,SHELL can satisfy real-time requirements.