传统聚类算法通常建立在显式的模型之上,很少考虑泛化模型以适应不同的数据,由此导致了模型不匹配问题。针对此问题,该文提出了一种基于空间映射(Mapping)及尺度变换(Rescaling)的聚类框架(简称M—R框架)。具体而言,MR框架首先将语料映射到一组具有良好区分度的方向所构建的坐标系中,以统计各个簇的分布特性,然后根据这些分布特性对各个坐标轴进行尺度变换,以归一化语料中各个类簇的分布。如上两步操作伴随算法迭代执行,直至算法收敛。该文将M-R框架应用到K—means算法及谱聚类算法上以验证其性能,在国际标准评测语料上的实验表明,应用了M—R框架的K-means及谱聚类在所有语料集上获得了全面的性能提升。
Traditional clustering algorithms suffer from model mismatch problem when the distribution of real data does not fit the model assumptions. To address this problem, a mapping and rescaling framework (referred as M-R framework) is proposed for document clustering. Specifically, documents are first mapped into a discriminative co- ordinate so that the distribution statistics of each cluster could be analyzed on the corresponding dimension. With the statistics obtained, a rescaling operation is then applied to normalize the data distribution based on the model assumptions. These two steps are conducted iteratively along with the clustering algorithm to improve the clustering performance. In the experiment, the M-R framework is applied on traditional k-means and the state-of-art spectral clustering algorithm Ncut. Resultss on well known datasets show that M-R framework brings performance improvements in all datasets.