文本数据维数高、数据分布稀疏、不同类别的特征相互重叠,这为聚类分析提出了挑战.针对文本数据的这一特点,将特征加权技术与软子空间相结合,基于模糊聚类的算法框架,提出了一种适用于高维文本数据的软子空间模糊聚类新方法.首先,基于加权范数理论,提出了新的特征加权距离计算方法.接着,将其与软子空间学习的理论框架相结合,提出了面向模糊聚类的新的目标学习准则.通过向约束条件中引入熵指数r,从而扩展了模糊指数m的取值范围,并给出了物理解释.基于Zangwill收敛定理对算法的全局收敛性给出理论证明.实验表明,文中算法可以使软子空间学习和聚类分析同时进行,其性能比现有的相关算法有了较大的提高.
The text data are characterized by high dimensionality and feature overlapping among different clusters, which is a great challenge for the real-world data mining applications. This paper proposes a novel fuzzy clustering algorithm by integrating the feature weighting metric into the framework of soft subspace learning. Firstly, the feature weighting metric is presented based on the concept of vector norm. Then a novel learning criterion is proposed based on the combination of feature weighting metric and soft subspace clustering. An entropy exponent r is intro- duced into the constraints so that the span of the fuzzy index m is extended. A physical explanation from the view of the information theory is given. A global convergence theory is also estab- lished by applying Zangwill's convergence theorem. At last, experiments are conducted on both synthesis and real text data and the experimental results show that the proposed algorithm can perform tasks of clustering analysis and soft subspace learning simultaneously and obtain better results than some of the existing approaches.