ROCK是Sudipno Guha等1999年提出的一个著名的面向分类属性数据的聚类算法,其突出贡献是采用公共近邻(链接)数的全局信息作为评价数据点间相关性的度量标准,而不是传统的基于两点间距离的局部度量函数.尽管ROCK在Mushroom等分类属性数据集上取得了很好的聚类结果,但该算法本身也存在一些缺陷和不足.首先,衡量两个数据点是否为邻居的相似度阈值θ需要预先静态指定,该阈值对聚类质量影响很大,在对数据集没有充分了解的前提下给出恰当的阈值是困难的.其次,在ROCK算法中,相似度函数sim仅被用于最初邻居的判断上,只考虑相似与否,而未考虑相似程度,使算法对θ值过于敏感.另外,ROCK还要求用户事先选定聚类簇数k.这些缺陷或者影响聚类效果,或使算法不便使用.该文深入分析了上述问题,并提出基于动态近邻选择模型的聚类算法DNNS,通过优选近邻来提高聚类质量.文中还定义了内聚度度量函数以指导聚类过程.对标准数据集VOTE和ZOO的实验结果表明,DNNS算法的fα指标优于ROCK和VBACC.
ROCK, proposed by Sudipno Guha et al in 1999, is a well known, robust, categorical attribute oriented clustering algorithm. The main contribution of ROCK is the introduction of a novel concept called "common neighbors" (links) as similarity measure between a pair of data points. Compared with traditional distance-based approaches, links capture global information over the whole data set rather than local information between two data points. Despite its success in clustering some categorical databases such as Mushroom, there are still some underlying weaknesses. First, the user is required to select a similarity threshold θ, a value that can significantly influence final clustering results. Without sufficient prior-knowledge, it is difficult to make a proper choice of value θ. Second, similarity function sire is only used to judge neighbors and the degree of similarity is lost during the iterative process of clustering, making the algorithm sensitive to the value of θ. Third, the number of desired final clusters must be pre-specified, which is also difficult without fully understanding of the data set. These shortcomings either hinder the algorithm from achieving even better clustering result, or make the algorithm inconvenient to use. This paper investigates the above problems and proposes a novel algorithm named DNNS using Dynamic Nearest Neighbors Selection model, which improves clustering quality with an appropriate selection of nearest neighbors. A new cohesion measure also is discussed to control the clustering process. Experimental results on standard databases VOTE and ZOO demonstrate that DNNS outperforms ROCK and VBACC based on the evaluation metrics of fα.