大规模高维数据集的聚类算法已成为当前聚类研究的热点,由于高维的原因,聚类往往隐藏在数据空间的某些子空间中,传统的聚类算法无法获得有意义的聚类结果.此外,高维数据中含有的大量的随机噪声也会带来额外的效率问题.为了解决以上问题,该文在CLIQUE算法的基础上提出了一种基于最优区间分割和数据集划分的聚类算法-OpCluster,并使用仿真数据对该算法加以验证,实验结果表明,OpCluster对大规模高维数据集具有很好的聚类效果.
Clustering large data set of high dimensionality has always been a serious challenge for clustering algorithms. Traditional clustering algorithms often fail to detect meaningful clusters because of the high dimensionality and inherently sparse feature space of most real-world data sets. Nevertheless, the data sets often contain clusters hidden in various subspaces of the original feature space. In addition, high-dimensional data often contain a significant amount of noise which causes additional effectiveness problems. To overcome these problems, a new algorithm based on CLIQUE, named OpCluster, is proposed. A set of experiments on a synthetic dataset demonstrate the effectiveness and efficiency of the new approach.