去除内容相同或相近的新闻是提高搜索引擎的关键技术之一。提出一种基于关键词提取的新闻去重算法,通过以标题为种子点构建词汇链的方法,能够找到对主题贡献大的非高频词,从而抽取出完整文档关键词集合,该方法能够基于小规模语料库识别新词;为了提高网页去重速度和质量,基于关键词建立去重倒排文档。实验结果显示,该方法与传统方法相比排斥错误率降低了5%,去重时间缩短了20%-30%。
Weeding out duplicated news is an important technique of search engine. A new algorithm to weed duplicated news is proposed using,the keyword extraction. The algorithm uses title as seeds to build lexical chain,can obtain integrated keywords set by screening out important but low occurrence words ,and recognizes unknown words by small scale corpus. In order to improve the speed and quality of weeding,the invert document is established by screened keywords. The experimental result shows that exclusive error rate of this algorithm is lower 5 % than that of classical algorithms ,and the time of weeding duplicated news drops 20-30%.