新闻话题聚类在舆情监督、热点话题发现、突发事件实时追踪等领域有着重要的应用.基于K-means的文本聚类算法由于算法简单易实现,时空复杂度低,聚类效果优异等特点被广泛用作新闻话题聚类算法.但传统的K-means算法又具有其局限性,如对初始中心点的选择敏感和用户必须自定义分组K等,导致算法收敛于局部最优而无法得到全局最优解.针对传统的K-means算法中初始聚类中心点随机选择导致聚类结果不稳定的问题,提出了一种改进的K-means算法用于新闻话题检测,该算法基于新闻报道相似性选择初始聚类中心点,保证各新闻话题集群具有很好的区分度.并在此基础上,根据新闻话题覆盖率自动确定话题集群个数K.实验结果表明,改进后的算法能够生成稳定的,高质量的话题集群.
News topic clustering plays an important role in the field of public opinion supervision,hot topic detection and re-al-time tracking. The text clustering algorithm based on K-means is widely used as a news topic clustering algorithm because of its simple and easy implementation,low space-time complexity and excellent clustering results. However,the traditional K-means al-gorithm has its limitations,such as the choice of the initial center point and the user to customize the K and so on,which leads to the algorithm to converge to the local optimal and can not get the global optimal solution. According to the initial clustering center of the traditional K-means algorithm in random selection leads to clustering instability problem,topic clustering for an improved K-means algorithm is proposed,the algorithm reports similarity to select the initial cluster center based on guarantee the news topic cluster has a good discrimination. And on this basis,according to the coverage rate of the news topic to determine the number of clus-ters K. The experimental results show that the improved algorithm can generate stable and high quality topic clusters.