近年来由于数据流应用的大量涌现,基于数据流模型的数据挖掘算法研究已成为重要的应用前沿课题.提出一种基于Hoeffding界的高维数据流的子空间聚类发现及维护算法--SHStream.算法将数据流分段(分段长度由Hoeffding界确定),在数据分段上进行子空间聚类,通过迭代逐步得到满足聚类精度要求的聚类结果,同时针对数据流的动态性,算法对聚类结果进行调整和维护.算法可以有效地处理高雏数据流和对任意形状分布数据的聚类问题.基于真实数据集与仿真数据集的实验表明,算法具有良好的适用性和有效性.
Data mining based on data stream has become a very hot research field in recent years. In this paper a novel discovering and maintenance algorithm of subspace clustering over high dimensional data streams is presented, which is based on Hoeffding bound and named SHStream. SHStream partitions data streams (the length of each segment is computed by Hoeffding bound), makes subspace clusters on the segments and discovers clusters step-by-step. Meanwhile, focusing on dynamic of data stream, SHStream adjusts and maintains the cluster results. SHStream can deal with high dimensional clustering problem effectively and discover clusters with arbitrary shape through the technology based on grids and density. The experimental results on real datasets and synthetic datasets demonstrate promising availabilities of the approach.