该文提出了一种面向微博检索的基于词汇时间分布的查询扩展方法.该方法利用扩展词与查询词的时间分布的相似性来度量扩展词与查询词之间的相关度,建立了基于词汇时间分布的查询模型.具体而言,该文在提出词汇时间分布的定义和估计方法的基础上,给出了查询词与扩展词的时间分布相似性的度量,以此作为它们的相关度,完成扩展词的选择和查询模型的重估.该文方法利用时间信息而不是内容来扩展查询,避免了基于内容的查询扩展方法因微博内容短而无法准确估计扩展词的不足.由TREC 2011和TREC 2012微博检索评测数据上的实验结果表明,基于词汇时间分布的查询扩展模型有效地提高了微博检索的性能,不仅显著优于经典的基于内容的查询扩展模型,而且优于其他利用时间进行查询扩展的方法.
In microblog retrieval, content-based query expansion methods are not adequate for expanding queries since the relevant microblog messages are too short to provide reliable term distribution information. Most of the existing time-based query expansion methods exploit time profile to shift the prior probability of relevant microblogs. In essence, these methods still could not avoid the restrictions of short texts since the relevance between expansion terms and query is still based on the content of microblogs. To address the problem, this paper proposes a query expansion method based on the time distribution of terms, in which the relevance between query terms and expansion terms is measured by their time distribution similarity. First, the changes of term frequency in different time segments are analyzed, the term time distribution is defined and the estimation methods are illustrated. Then a similarity estimation approach of term time distribution is presented to estimate the relevance of query terms and expansion terms, so as to decide the expansion terms in the re-estimated query model. Two query expansion strategies are given to estimate the query expansion model according to the relevance of expansion terms and query. Finally, by integrating the query expansion model and original query model, the term time distribution query model is presented. The effort to use only time profile to establish the relevance between query terms and expansion terms avoids the drawbacks of the classical content-based query expansion approaches due to the length limit in microblog. Experiments were carried on TREC 2011 and TREC 2012 microblog retrieval collection. Several state-of-the-art baselines are chosen for comparing with our method, including the classical language model, the content-based query expansion method and the time-based query expansion method. The experimental results show that the term time distribution query model outperforms the content-based as well as the time-based approaches.