微博文本具有短小快捷、主题多变等特点,社交话题检测与跟踪研究面临新的挑战。结合微博的话题时序性和短文本语义相似度等特点,提出了基于微博聚类的话题检测与跟踪系统方法。首先,通过定义微博文本的时序频繁词集,给出面向热点话题的特征词选择方法;然后,根据时序频繁特征词集,利用最大频繁项集获得微博初始聚类;针对初始簇间存在文本重叠情况,提出基于短文本扩展语义隶属度的簇间重叠消减算法,获得完全分离的初始簇;最后,根据簇语义相似度矩阵,给出凝聚式话题聚类方法。通过新浪微博完成实验测试,表明所提方法可用于中文微博热点话题检测与跟踪。
As a widely used tool in social networks, microblog is definitely with short document, quick broadcasting and topic changeable, which results in big challenging for social topic detection and tracking. A new systematic framework for micro-blog topic detection and tracking was proposed based on the microblog clustering using temporal trend and semantic similarity. Firstly, a feature words selection method for hot topics was presented by defining the temporal frequent words set. Secondly, an initially clustering was conducted depending on the selected temporal frequent words set. As far as the overlaps between initial clusters concerned, an effective overlap elimination algorithm was proposed, by introducing the extended short document semantic membership, to separate any possible overlapped initial clusters. Finally, an aggregated topic clustering method was employed using the cluster semantic similarity matrix. The experiments were at last done on some real-world dataset from Sina microblog. It show that the method for chinese microblog topic detection and tracking can obtain excellent performance and results.