为解决传统词共现方法在微博中检测话题时计算复杂度大、查全率不高、查准率低的情况,提出一种基于粗糙集原理的改进词共现算法(RSCW).通过词共现关系形成词共现矩阵,并由共现矩阵找出极大完全子图作为话题簇中心,最后由粗糙集原理找出每个话题的关键词集合.在NLPIR微博内容语料库和实时获取的微博数据集上的实验结果表明,该方法能够有效地从大规模微博信息中检测突发新闻,提高突发新闻的识别率.
Traditional word co-occurrence detection methods in microblog news encounter the problems of high computational complexity, high time consuming, low recall rate and low precision. An improved algorithm of word co-occurrence detection based on rough set is proposed in this paper aiming at solving these problems. It builds a word co-occurrence matrix through word co-occurrence relation, and finds out the maximum complete subgraph as topic cluster center via co-occurrence matrix, finally identifies each topic keyword set using the rough set theory. The experimental results carried out on the microblog content corpus of NLPIR and the real-time collection of microblog data set verify that this method can effectively detect news topic from the massive microblog information and realize the news topic tracking.