以微博为代表的社交网络已经成为用户发布和获取实时信息的重要手段,然而这些实时信息中很大一部分都是垃圾或者是冗余的信息。通过有效的手段,精准地发现、组织和利用社交网络海量短文本背后隐藏的有价值的信息,对微博中隐含主题的挖掘,具有较高的舆情监控和商业推广价值。尽管概率生成主题模型LDA(Latent Dirichlet Allocation)在主题挖掘方面已经得到了广泛的应用,但由于微博短文本消息语义稀疏以及文本之间相互关联等特点,传统的LDA模型并不能很好地对它进行建模。为此,基于LDA模型,综合考虑微博的文本关联关系和联系人关联关系,提出了适用于处理微博用户关系数据的UR-LDA模型,并采用吉布斯抽样对模型进行推导。真实数据集上的实验结果表明,UR-LDA模型能有效地对微博进行主题挖掘。
Social network in particular microblog has become a significant way for users to propagate and retrieve information. However, a large proportion of the real time information is junk or redundant. So the discovery of latent topics in social networks through finding, organizing and using valuable information behind the mass passage with effective ways carries high value in public option mohitoring and commercial promotion. Although probabilistic generative topic model (Latent Dirichlet Allocation,LDA) has been widely applied in the field of topics mining,it cannot work well on microblog, which contains little information and has connection with others. A novel probe- bilistic generative model based on LDA, called UR-LDA, has been proposed which is suitable for modeling the micro-blog data and tak- ing the document relation and user relation into consideration to help mining in micro-biog. A Gibbs sampling implementation for infer- ence the UR-LDA model has been also presented. Experimental results used with actual dataset show that UR-LDA can offer an effective solution to topic mining for microblog.