互联网的飞速发展和海量数据的不断增长,使得如何快速、有效地识别当前新闻热点信息成为迫切需求。在线新闻话题发现已成为当前研究热点。对于在线环境下的新闻文本特征表示,传统向量空间模型随着数据的增长向量维度不断增长,使得数据稀疏和同名异议问题愈加明显,导致文本相似度难以准确度量。使用基于特征加权的隐含语义分析将高维、稀疏的词一文档矩阵映射到隐藏的k维语义空间,充分挖掘词、文档之间的语义信息,以提高同主题文档间的语义相似度,克服在线环境下文本稀疏性和同名异议问题。此外,对于不断增长的大规模新闻数据,传统聚类算法存在时间复杂度过高或者输入依赖等问题,难以快速、有效地得到理想结果。基于新闻报道在时间上的顺序性和相关性,提出改进的Single—pass在线增量聚类算法检测话题类,并引入话题热度值的概念来筛选当前关注度较高的热点话题。实验结果表明,该方法能够有效提高话题检测的准确率,实现基于真实新闻数据集的在线话题捕捉。
With the rapid development of the Internet and the continuous increasing of massive data, how to identify the current news topic quickly and effectively is becoming an urgent demand, and online hot news topic detection has become an hot area of research. For online news stream, the degree of traditional Vector Space Model (VSM) will grow with the increasing of data, resulting in obvious problem of data sparsity and synonymy, which makes it difficult to quickly and accurately calculate the similarity of texts. The latent semantic analysis based on weighted features is used to map the sparse matrix with high-dimension of words and documents to the hidden k-dimension se- mantic space, making full use of the semantic information between words and documents to improve the semantic similarity between the same subject documents, overcoming the problems of text sparsity and synonymy in Intemet. In addition, traditional clustering algorithm exists the problem of high time complexity and input dependency for increasing massive news data, which is difficult to get the expected result quickly and efficiently. A Single-pass online clustering algorithm is used to detect the topic clusters based on succession and corre- lation in time for news, and the concept of topic heat is introduced to screen the public attention of news topics. Experiment shows that the method proposed can effectively improve the accuracy of the detection of topics.