Deep Web是隐藏在Surface Web之后的信息提供者,而且在Deep Web之中还隐藏着更大量的信息.目前,对Deep Web中的信息进行有效的获取的可行方法是通过Deep Web提供的查询接口对其进行访问.自动抽取查询接口中的属性并生成正确的查询条件是提升访问Deep Web能力的有效方法.查询接口中属性之间存在着不同的语义约束关系,如互斥和共存.为了生成有效的查询条件,必须发现并协调关键属性间的语义关系.为了解决些问题,提出一个基于本体技术并充分利用实例信息的表单属性自动抽取方法,在这一方法中使用WordNet来丰富抽取出的关键属性并发现表单中属性间的语义关系.在属性抽取过程中,每个属性被拓展生成一个备选属性集并且以树型数据结构存储,而且备选属性树可以有效的描述属性间的语义关系.在现实领域中的试验证明,这一框架结构可以自动的抽取Deep Web表单属性并有效的生成查询条件.
The Deep Web is behi-nd the Surface Web and more information is hidden in it. The search engines and the web crawlers can not access the Deep Web directly. The only and workable way to access the hidden database is through query interface. Automatic extracting attributes from the query interface and translating a query is a solvable way for addressing the current limitations in accessing Deep Web data sources. The query interface provides semantic constraints, some attributes are co-occurred and the others are exclusive sometimes. To generate a valid query, we have to reconcile the key attributes and semantic relation between them. We design a framework to automatically extract the attributes from the query interface taking full advantage of instance information and use the WordNet as a kind of ontology technique to enrich the attributes embedded in the semantic query interface. Each attribute is extended into a candidate attribute set in the form of a hierarchy tree. We carry out our experiments in the real-world domain. The results of the experiments showed the validation of query translation framework.