文章介绍了体育新闻搜索引擎系统Geeking的框架结构和各项功能,其结构分为网页爬取、胜者表构建、检索处理、用户界面4个部分,其主要功能包含查询词校正、自动补全、检索结果排序、相似新闻聚类以及显示页面中关键词高亮并提供网页快照。输入查询请求时,系统根据搜索日志和新闻热词自动补全查询词,搜索不到相关结果时校正查询,给出推荐的查询词。检索新闻文档时,使用胜者表快速查找查询词项的相关文档,综合tf-idf权重和新闻标题、发布时间等因素计算文档的相关性并按得分排序。在相似新闻聚类中,结合最长公共子序列和编辑距离衡量新闻标题之间的相似度,以新闻标题相似度代表新闻文档的相似度。测试结果表明,基于胜者表的Geeking搜索引擎系统各项功能协调效果好,检索响应速度快。
In this paper, a sports news search engine, Geeking, was introduced, which contains four functional models: web crawling, champion list building, search processing and user interface. Geeking could provide query correction, query auto-completion, search results sorting, news clustering, keywords highlighting and snapshot visualization. Given a query, the system automatically completes the query according to the search logs and the news hot keywords. If there was no return of result, the system could correct the query and provided the recommended query terms. The related documents were searched quickly according to the champion list. Based on the tf-idf values and other factors like news headlines and release time, the documents' relevance was calculated. For the clustering of similar news, the longest common subsequence and levenshtein distance were used to measure the similarity between news headlines and the similarity of news headlines could be regarded as the similarity between documents. Test results were given to show that Geeking is fast and stable.