针对现有的文本聚类算法难以取得满意结果的问题,以EM算法为基础,提出能分别描述相似、不相似聚类对的相似性分布以及重要、不重要文档的重要性分布的文本聚类优化模型(text clustering optimization model,TCOM).基于该模型,设计一种通过合并不同的文本聚类结果以获取最优性能的方法.实验结果表明,利用该方法同时改善了聚类精度和召回率,其性能优于单独使用现有的硬、软聚类算法.
A model named TCOM (text clustering optimization model ) based on expectation-maximization (EM) algorithm is proposed to solve the problem that the existing text clustering algorithms can not achieve satisfactory results. This model describes the similarity distribution of the similar and non-similar pair of clusters, and presents the importance distribution of the important and unimportant documents. The method based on TCOM optimizes the performance by merging different text clustered results. Experimental results show that clustering precision and recall are both improved, and its performance is higher than that of either hard clustering method or soft clustering method.