鉴于聚类分析是机器学习和数据挖掘领域的一项重要技术,并且与监督学习不同的是聚类分析中没有类别或标签的指导信息,所以如何选择合适的聚类个数(即模型选择)一直是聚类分析中的难点.由此提出了一种基于Dirichlet过程混合模型的聚类算法,并用collapsed Gibbs采样算法对混合模型的参数进行估计.新算法基于非参数贝叶斯模型的框架,能够在不断的采样过程中优化模型参数并形成合适的聚类个数.在人工合成数据集和真实数据集上的聚类实验结果表明:基于Dirichlet过程混合模型的聚类算法不但能够自动确定聚类个数,而且具有较强灵活性和鲁棒性.
Clustering is one of the most useful techniques in machine learning and data mining.In cluster analysis,model selection concerning how to determine the number of clusters is an important issue.Unlike supervised learning,there are no class labels and criteria to guide the search,so the model for clustering is always difficult to select.To tackle this problem,we present the concept of nonparametric clustering approach based on Dirichlet process mixture model(DPMM),and apply a collapsed Gibbs sampling technique to sample the posterior distribution.The proposed clustering algorithm follows the Bayesian nonparametric framework and can optimize the number of components and the parameters of the model.The experimental result of clustering shows that this Bayes model has promising properties and robust performance.