子空间聚类已经广泛应用于多个涉及高维数据聚类应用领域,受到机器学习研究者的广泛关注.子空间聚类方法是一种使用特征选择的聚类分析技术,通过选择重要特征子集实现对高维空间的低维表示,在实际应用中能够取得更好的性能,成为流行的高维数据聚类方法.与硬聚类方法相比,软聚类能够给出复杂数据更有意义的划分.扩展k-均值聚类并提出基于可靠性的正则化加权软k-均值新的子空间聚类方法(Reliability-based regularized weighted soft k-means clustering algorithm,RRWSKM),该方法能够计算每个特征对每个聚类的贡献度,从而找到与不同聚类相关的重要特征子集.另外,该方法能够通过调整模型参数准确地辨识数据模式,具有良好的聚类性能.该方法把维度加权熵和划分熵作为正则化项引入到目标函数,避免过拟合问题同时使更多的特征参与辨识聚类.为了提高算法的鲁棒性,使用可靠性测度获得特征权重初始值,提高算法的可靠性和性能.考虑到该算法是非凸优化问题,使用迭代优化方法得到优化问题的最优解.使用多个实际数据集对本文算法进行仿真验证,结果表明,与其他子空间聚类算法相比,该算法能够有效发现高维数据的低维表示,具有良好的聚类性能,适合高维数据的聚类.
Subspace clustering methods have been widely employed in many fields involved in high-dimensional data clustering and attracted more and more attentions. Subspace clustering method is a clustering analysis technique with feature selection and can achieve better performances by selecting a subset of salient features and performing clustering on the low-dimensional representation of the high-dimensional data. In many practical applications, it is known that soft clustering can provide more meaningful partition of complex data than hard clustering. In this paper, we extend the k-means clustering and present a novel reliability-based regularized weighted soft k-means clustering algorithm(RRWSKM). The method can calculate the contribution of each dimension in each cluster and find different subsets of salient dimensions relevant to different clusters. Furthermore, it can also identify the exact data patterns by tuning model parameters and exhibit good performance. These are achieved by incorporating dimension weight entropy and partition entropy terms as regularizations into the objective function to avoid overfitting and stimulate more dimensions to contribute to identify the clusters. In addition, the reliability of dimension weights is retained by exploiting the data reliability measure, and the initial dimension weights can be determined, enhancing the performances and robustness of the proposed algorithm greatly. Since the optimization problem of RRWSKM is non- convex, the optimal solution is achieved by solving the optimization problem through an iterative update formulations. Some experiments on real-world data sets are conducted to verify the novel algorithm. The results of the experiments showed that the proposed method can exhibit the low-dimensionality representations of high- dimensional data and achieve better clustering performances than other subspace clustering methods and can handle with the high-dimensional data well.