针对传统的话题检测方法在处理大规模微博短文本时出现的降维能力不足和语义信息丢失等问题,提出基于潜在语义分析和结构特性相结合的微博话题检测方法。根据微博的对话属性和传播模型,首先要合并微博讨论树扩展微博文本,创建基于潜在语义分析(ISA)的微博文本模型以解决数据稀疏性问题,最后结合时间信息给出新的相似度计算方法,并采用凝聚层次聚类法检测微博话题。实验结果表明,提出的方法降低了话题检测的错失率,大大提高了微博话题检测的性能。
In connection with the problem of insufficient dimension reduction ability and missing semantic information in han- dling microblog short text on a large scale by the traditional topic detection method, this paper proposed the microblog topic de- tection method based on the combination of latent semantic analysis(LSA) and structural property. According to the dialogue properties and propagation model of microblog, the proposed method firstly merged the microblog discussion tree to extend the microblog text. Secondly, it created a microblog text model based on LSA to resolve the problem of data sparsity. Finally, it offered new computational method of similarity combined with the time information and detect the microblog topic by conden- sing the hierarchical method. The experimental results show that the proposed method can reduce the miss ratio of topic detec- tion and significantly improve the performance of microblog topic detection.