针对回顾式话题检测方法存在的话题检测时效性较差的问题,提出了改进的位置敏感哈希(LSH)算法,并应用于互联网新闻层次化话题检测.在挖掘新闻内容特征的同时,应用潜在狄利克雷分布主题模型挖掘新闻的语义特征,将非二进制空间的内容特征向量和主题特征向量转换到二进制特征空间上,依次应用LSH算法对新闻文本基于内容特征和主题特征聚类,得到具有"主题-内容"层次的话题.实验结果表明,该方法通过挖掘新闻的内容特征和主题特征,能更准确和完整地表现新闻内容;将内容特征和主题特征转换到统一的二进制空间,有效降低了聚类过程的时间复杂度,在保证话题检测准确率和话题在语义层面上扩展性的前提下,提高了话题检测的效率.
To improve the timeliness of detecting topics in retrospective topic detection, an improved locality sensitive Hashing (LSH) algorithm is proposed and applied in constructing hierarchical topic model for web news. Firstly, the news content feature is excavated, and the topic feature is excavated using latent dirichlet allocation model. Then the non-binary content eigenvector and topic eigenvector are converted to binary feature space. Finally, news articles are clustered in order using binary content eigenvector and binary topic eigenvector by LSH, and the hierarchical topic-content news topic model is generated. Experiments prove the following results: extracting content feature and topic feature can express the news exactly; converting content eigenvector and topic eigenvector to unified binary space can reduce the time complexity of clustering, and thus increase the efficiency of topic detection while ensure the accuracy and semantic expansibility.