微博中短文本、用语不规范和大量噪音等特性使得传统话题发现方法不能很好地从中获取新话题。针对微博以上特性和话题动态性提出一种基于聚类集成的微博话题发现方法,该方法考虑微博发布的非线性时间因子,采用改进的K-Means方法分别融合微博的各个特性构造其对应的基聚类器,并评估各基聚类器之间的有效性和差异性,以此设置集成投票权值并最终进行聚类集成。实验对比结果表明,该方法将微博发现话题的准确性提升约9.5%,能够更有效地探测到新话题。
The short text,randomness and a large amount of noise make the traditional methods of topic detection can not be solved to get the new topic,and these topic detection techniques have not considered the time factor of the microblog post.In this paper,the microblog topic detection method based on clustering ensemble is proposed for the characteristics of micro-blog and topic dynamic performance.This method considers the nonlinear time factor of microblog post,the improved K-Means method is used to construct the corresponding base cluster based on each feature of microblog,evaluate the effectiveness and difference between the each cluster,so as to set up the ensemble voting weights and the clustering ensemble is used for microblog topic detection.Experimental results show that the proposed method gets an accuracy up to9.5%in microblog topic detection,which can detect the new topic more effectively.