query和doc之间的关联关系是搜索引擎期望获取的一类有价值的信息.query和doc间准确的关联分析不仅可以帮助搜索结果排序,也在query和doc之间的桥接中起到重要作用,以实现相关query和doc之间的信息传递,有利于更深入的query理解和doc理解,并在此基础上开展相关应用.本文提出了一种基于用户搜索行为的query和doc关联关系挖掘算法,该方法首先对用户搜索点击日志中的数据进行整理与分析,构建query与doc间的二部图,再通过采用马尔可夫随机游走模型对二部图数据进行建模,挖掘二部图中的点击数据和session数据,最终挖掘出点击日志中用户没有点击到的doc数据,从而预测出query和doc间的隐含关联关系,同时也可以利用该算法得到query和query潜在的关联关系.基于以上理论基础,我们实现了一套完整的日志挖掘系统,通过大量的实验对比,该系统在各方面均取得了优异的表现,其中对检索结果相关性的性能提升可以达到71.23%,这充分表明,本文所提出的理论和算法能够很好地解决query和doc之间的隐含关系挖掘问题,为提高搜索结果的召回率、实现查询推荐和检索结果聚类奠定了良好的前提基础.
The relationship between queries and docs is a valuable type of information that search engines hope to obtain. An exact correlation analysis between queries and docs is not only helpful for ranking search result, but also important for building a bridge between queries and docs to allow information transfer between related queries and docs,which is beneficial to a deep understanding of queries and to a series of applications. This paper presents a query-doc relation mining algorithm based on user search behavior. Initially, we collect and analyze users' search log data to build a bipartite graph between queries and docs. Next we model the bipartite data using a Markov random walk model, and then mine the click-through data and session data from the bi-partite graph. Eventually, we can obtain doc data that the user did not click in the click-through data and predict the implied relationship between queries and docs. Besides, we can also take advantage of the algorithm to get the potential relationship between queries and queries. Based on the theoretical foundation described above, we construct a complete log data mining system. Through a large number of experimental contrasts,the system shows outstanding performance on many aspects, such as increasing relevance up to 71.23 %, which indicates that the theory and algorithms proposed in this paper can solve the problem of mining implicit relationships between queries and docs effectively. Our approach provides a good basis for increasing recall of search results, optimizing query recommendation and clustering retrieved results.