东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于层次树模型的Deep Web数据提取方法

ISSN号：1000-1239
期刊名称：计算机研究与发展
时间：0
页码：94-102
分类：TP311.13[自动化与计算机技术—计算机软件与理论;自动化与计算机技术—计算机科学与技术]
作者机构：[1]软件工程国家重点实验室（武汉大学）,武汉430079, [2]武汉大学计算机学院,武汉430079
相关基金：国家自然科学基金项目（60970018）
相关项目：Web社区用户个性挖掘与排序研究

关键词：隐藏数据库, 数据提取, 多属性值接口, top-k元组, 互信息, hidden database, data retrieval, multi-attribute interfaces, top-k tuple, mutual information

中文摘要：

网络在成为信息查询和发布平台的同时,海量的信息隐藏在查询受限的Web数据库中,使得人们无法有效地获取这些高质量的数据记录.传统的Deep Web搜索研究主要集中在通过关键字接口获取Web数据库内容.但是,由于Deep Web具有多属性和top-k的特点,基于关键字的方法具有固有的缺点,这就为Deep Web查询和检索带来了挑战.为了解决这个问题,提出了一种基于层次树的DeepWeb数据获取方法,该方法可以无重复和完整地提取Web数据库中的数据记录.该方法首先把Web数据库模型化为一棵层次树,Deep Web数据获取问题就转化为树的遍历问题.其次,对树中的属性排序,缩小遍历空间;同时,利用基于属性值相关度的启发规则指导遍历过程提高遍历效率.最后,在本地模拟数据库和真实Web数据库上的大量实验证明,这种方法可以达到很好的覆盖度和较高的提取效率.

英文摘要：

While the Web provides a platform for information search and dissemination,massive information is hidden behind in the query restricted Web databases,which makes it difficult to obtain these high-quality data records.The current research on Deep Web search has focused on crawling the Deep Web data via Web interfaces with Key words：queries.However,these keywords-based methods have inherent limitations because of the multi-attributes and top-k features of the Deep Web.This poses a great challenge for Web information search and retrieval.To address this problem,we propose an approach for siphoning structured data based on hierarchy tree,which can retrieve all the data non-repeatedly in the hidden databases.Firstly,we model the hidden database as a hierarchy tree.Under this theoretical framework,data retrieving is transformed into a traversing problem in the hierarchy tree.Secondly,we also propose techniques to narrow the query space and obtain the attribute values by sorting the attributes according to the ascending order.Thirdly,we leverage the mutual information to measure the attribute values dependency.Based on the attribute values dependency,we narrow the traversal space by using heuristic rule to guide the traversal process.Finally,we conduct extensive experiments over real Deep Web sites and controll databases to illustrate the coverage and efficiency of our techniques.

同期刊论文项目