在微博热点话题发现中,由于微博文本短、词量少、用词不规范等特征,使得传统的热点话题检测方法力不从心。针对这一问题,提出了基于速度增长的微博热点话题发现方法。首先把经过预处理的微博按等数量窗口划分,统计每个窗口内各词语的词频,并表示成时间二元组序列;然后通过计算每相邻两个窗口的个词语的增长斜率来发现增长速度快的词语;再通过计算与该词语有关的用户的增长速度和微博条数的增长速度来确定该词语是否是热点主题词;最后通过热点主题词聚类产生热点话题。通过实验验证了该方法的可行性。实验结果表明,该方法在一定程度上提高了检测效率,降低了漏检率和误检率,可以有效地及时发现微博热点话题。
In hot topics found on micro-blog, because the text of micro-blog is short and less words, and the terms are not standard, so the traditional hot topic detection method can not find hot topics effectively. In order to solve this problem, this paper presented a method of hot topics found based on speed growth. Firstly, it divided the pretreated micro-blogs on the basis of the equal number of window, and added up the term frequency in each window, and expressed as feature trajectory of binary group sequence. Secondly, it calculated the growth slope of every adjacent two windows to find the words with growth speed. Thirdly,it calculated the growth speed of the word' s relevant users and the growth speed of the word' s relevant micro-blogs to ensure the word was hot subject or not. Finally,it found hot topics through the hot subject clustering. The experimental proves the feasibility of the algorithm, results show that the method improves the efficiency of the detection to a certain extent, and re- duces the undetected rate and false detection rate, it can effectively discover hot topics on micro-blog in time.