基于大规模分布式搜索引擎系统——北大“天网”的用户点击记录,本文研究发现:用户点击不同URL的数量遵从Heaps定律,点击URL的频度频级服从类Zipf分布,点击URL与页面大小相关,点击URL具有时间局部性,其顺序具有自相似性特征等一些具有普适性的规律。提出了利用点击日志确定相近查询词的一个新的有效算法。这些研究结果对于掌握用户的搜索行为,完善搜索引擎系统的设计,提高检索服务的效率和质量具有重要的意义。
Tianwang Search Engine is a large-scale search engine system which is now maintaining index of about 240 millions web pages and 20 millions ftp files. In this paper, we analyze the eliekthrough data in the click log of the WWW search service of Tianwang. The results show that the number of unique URLs selected by users conforms to Heaps law, and the popularity versus rank for the URLs selected by users is well fit by a Zipf-like distribution. The frequency of the URLs selected by users is correlated to their page size. The clicking of URLs also present high degree of locality. For a given query, a new and effective algorithm is presented to find the related queries. All these research results are very important to improve the effectiveness and efficiency of the search engine system and to the research on the search behavior of the users.