位置:成果数据库 > 期刊 > 期刊详情页
基于词汇时间分布的微博查询扩展
  • ISSN号:0254-4164
  • 期刊名称:《计算机学报》
  • 时间:0
  • 分类:TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
  • 作者机构:[1]哈尔滨工业大学计算机科学与技术学院,哈尔滨150001, [2]黑龙江工程学院计算机科学与技术学院,哈尔滨150050
  • 相关基金:国家自然科学基金(61370170,61402134,61173074)、国家社科基金(14CTQ032)资助.
中文摘要:

该文提出了一种面向微博检索的基于词汇时间分布的查询扩展方法.该方法利用扩展词与查询词的时间分布的相似性来度量扩展词与查询词之间的相关度,建立了基于词汇时间分布的查询模型.具体而言,该文在提出词汇时间分布的定义和估计方法的基础上,给出了查询词与扩展词的时间分布相似性的度量,以此作为它们的相关度,完成扩展词的选择和查询模型的重估.该文方法利用时间信息而不是内容来扩展查询,避免了基于内容的查询扩展方法因微博内容短而无法准确估计扩展词的不足.由TREC 2011和TREC 2012微博检索评测数据上的实验结果表明,基于词汇时间分布的查询扩展模型有效地提高了微博检索的性能,不仅显著优于经典的基于内容的查询扩展模型,而且优于其他利用时间进行查询扩展的方法.

英文摘要:

In microblog retrieval, content-based query expansion methods are not adequate for expanding queries since the relevant microblog messages are too short to provide reliable term distribution information. Most of the existing time-based query expansion methods exploit time profile to shift the prior probability of relevant microblogs. In essence, these methods still could not avoid the restrictions of short texts since the relevance between expansion terms and query is still based on the content of microblogs. To address the problem, this paper proposes a query expansion method based on the time distribution of terms, in which the relevance between query terms and expansion terms is measured by their time distribution similarity. First, the changes of term frequency in different time segments are analyzed, the term time distribution is defined and the estimation methods are illustrated. Then a similarity estimation approach of term time distribution is presented to estimate the relevance of query terms and expansion terms, so as to decide the expansion terms in the re-estimated query model. Two query expansion strategies are given to estimate the query expansion model according to the relevance of expansion terms and query. Finally, by integrating the query expansion model and original query model, the term time distribution query model is presented. The effort to use only time profile to establish the relevance between query terms and expansion terms avoids the drawbacks of the classical content-based query expansion approaches due to the length limit in microblog. Experiments were carried on TREC 2011 and TREC 2012 microblog retrieval collection. Several state-of-the-art baselines are chosen for comparing with our method, including the classical language model, the content-based query expansion method and the time-based query expansion method. The experimental results show that the term time distribution query model outperforms the content-based as well as the time-based approaches.

同期刊论文项目
期刊论文 7 会议论文 2
同项目期刊论文
期刊信息
  • 《计算机学报》
  • 北大核心期刊(2011版)
  • 主管单位:中国科学院
  • 主办单位:中国计算机学会 中国科学院计算技术研究所
  • 主编:孙凝晖
  • 地址:北京中关村科学院南路6号
  • 邮编:100190
  • 邮箱:cjc@ict.ac.cn
  • 电话:010-62620695
  • 国际标准刊号:ISSN:0254-4164
  • 国内统一刊号:ISSN:11-1826/TP
  • 邮发代号:2-833
  • 获奖情况:
  • 中国期刊方阵“双效”期刊
  • 国内外数据库收录:
  • 美国数学评论(网络版),荷兰文摘与引文数据库,美国工程索引,美国剑桥科学文摘,日本日本科学技术振兴机构数据库,中国中国科技核心期刊,中国北大核心期刊(2004版),中国北大核心期刊(2008版),中国北大核心期刊(2011版),中国北大核心期刊(2014版),中国北大核心期刊(2000版)
  • 被引量:48433