针对目前流聚类算法无法有效处理数据流离群点的检测和处理,以及增量式数据流聚类效率较低等问题,提出了一种基于密度度量的异常检测、删除的增强型仿射传播流聚类算法。在仿射传播流聚类算法的基础上,所提算法通过引进异常检测和删除机制改善了异常点对聚类精度、聚类效率的影响。利用仿射传播聚类实现在线数据流的聚类过程,同时检测数据漂移现象,即数据流分布特征随时间发生变化,并采用基于密度度量的局部异常因子检测技术(LOF)对储备池数据进行异常检测和删除处理,通过对当前类簇和处理过的储备池数据重聚类来重建动态数据流模型。在真实网络数据(KDD’99)上进行了实验,结果表明,所提算法不仅减少了重聚类构建动态模型的次数,改善了聚类效率,而且在同时考虑聚类精度、纯度和熵3种聚类评价标准下,均优于传统的仿射传播流聚类算法。
Aiming at the problem that the traditional stream clustering algorithm cannot effectively deal with the inspection and treatment of outliers, and the incremental data stream clustering efficiency is low, an enhanced stream clustering algorithm based on affinity propagation using density measurement was proposed. Based on the STRAP, the proposed algorithm can improve the clustering accuracy and efficiency by introducing a mechanism for outlier detection and removal. Firstly, the online stream clustering process is realized by the affinity propagation algorithm. Meanwhile, the phenomenon of data drift is detected, i. e. , the distribution of data stream changes with time. In view of this phenomenon, the new algorithm can implement the outlier detection and removal in the reservoir based on local outlier factor, and then re-cluster the current cluster and the treated reservoir to reconstruct the dynamic stream clustering model. Finally, through the validation on the KDD' 99 data, the experimental results showed that the proposed algorithm not only reduces the number of re-clustering and improves the clustering efficiency, but also is superior to the STRAP in terms of the three clustering evaluation criteria, i. e. , the clustering accuracy, purity and entropy.