随着计算机技术及感知技术的发展及应用,各个领域普遍出现不确定性数据流形态的新型数据,吸引了众多研究者的关注.现有的数据流聚类技术普遍忽略不确定性特征,常导致聚类结果的不合理甚至不可用.为数不多的针对不确定性特征的聚类方法片面考察不确定性,且大多基于K-Means算法,具有先天缺陷.针对这一问题展开研究,提出了不确定度模型下数据流自适应网格密度聚类算法(adaptive density-based clustering algorithm over uncertain data stream,ADC-UStream).对于不确定性特征,该算法在存在级和属性级不确定性统一策略下,构建熵不确定度模型进行不确定性度量,综合考察不确定性.采用网格密度的聚类算法,基于衰减窗口模型设计时态和空间的自适应密度阈值,以适应不确定性数据流的时态性和非均匀分布特征.实验结果表明,不确定模型下的数据流网格密度自适应聚类算法ADC-UStream在聚类结果质量和聚类效率方面都具有较好的性能.
Uncertain data stream, a new widespread data form which is emerging in many application fields with the development of computer and sensing technology. The research of data analysis and processing of uncertain data stream has attracted the attention of many researchers. Existing data stream clustering techniques generally ignored uncertainty characteristics. It often makes the clustering results unreasonable even unavailable. The two aspects of uncertain character, existence- uncertainty and attributive-uncertainty, can affect the clustering process and results significantly. But they can't be considered at same time in existing relevant work. The lately reported clustering algorithms are all based on K-Means algorithm with inherent shortage. In order to solve this problem, a data stream adaptive grid-density based algorithm, ADC-UStream, is proposed under the uncertainty of mode[. For the uncertainty characteristic, with the unified strategy of the presence and properties uncertainty, the algorithm builds the entropy uncertainty model to measure the uncertainty. With the comprehensive survey of uncertainty, the grid-density based clustering algorithm over attenuation window model is adopted to design the temporal and spatial adaptive density threshold, to adapt to the temporal and non-uniform distribution characteristics of the uncertainty data flow. The experimental results show that the ADC-UStream algorithm under the uncertainty model has good performance both in clustering quality and clustering efficiency.