网络在成为信息查询和发布平台的同时,海量的信息隐藏在查询受限的Web数据库中,使得人们无法有效地获取这些高质量的数据记录.传统的Deep Web搜索研究主要集中在通过关键字接口获取Web数据库内容.但是,由于Deep Web具有多属性和top-k的特点,基于关键字的方法具有固有的缺点,这就为Deep Web查询和检索带来了挑战.为了解决这个问题,提出了一种基于层次树的DeepWeb数据获取方法,该方法可以无重复和完整地提取Web数据库中的数据记录.该方法首先把Web数据库模型化为一棵层次树,Deep Web数据获取问题就转化为树的遍历问题.其次,对树中的属性排序,缩小遍历空间;同时,利用基于属性值相关度的启发规则指导遍历过程提高遍历效率.最后,在本地模拟数据库和真实Web数据库上的大量实验证明,这种方法可以达到很好的覆盖度和较高的提取效率.
While the Web provides a platform for information search and dissemination,massive information is hidden behind in the query restricted Web databases,which makes it difficult to obtain these high-quality data records.The current research on Deep Web search has focused on crawling the Deep Web data via Web interfaces with Key words:queries.However,these keywords-based methods have inherent limitations because of the multi-attributes and top-k features of the Deep Web.This poses a great challenge for Web information search and retrieval.To address this problem,we propose an approach for siphoning structured data based on hierarchy tree,which can retrieve all the data non-repeatedly in the hidden databases.Firstly,we model the hidden database as a hierarchy tree.Under this theoretical framework,data retrieving is transformed into a traversing problem in the hierarchy tree.Secondly,we also propose techniques to narrow the query space and obtain the attribute values by sorting the attributes according to the ascending order.Thirdly,we leverage the mutual information to measure the attribute values dependency.Based on the attribute values dependency,we narrow the traversal space by using heuristic rule to guide the traversal process.Finally,we conduct extensive experiments over real Deep Web sites and controll databases to illustrate the coverage and efficiency of our techniques.