东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于模板和领域本体的DeepWeb信息抽取研究

ISSN号：1000-7024
期刊名称：《计算机工程与设计》
时间：0
分类：TP311[自动化与计算机技术—计算机软件与理论;自动化与计算机技术—计算机科学与技术]
作者机构：[1]南京信息工程大学江苏省网络监控中心,江苏南京210044, [2]南京信息工程大学计算机与软件学院,江苏南京210044
相关基金：国家自然科学基金项目（61103142）

关键词： Deep, Web, 信息抽取, 模板, 领域本体, 决策树, Deep Web, information extraction, template, domain ontology, decision tree

中文摘要：

为简化模板的抽取规则、提高抽取的准确率，提出了一种基于双模板和领域本体的DeepWeb信息抽取方法。该方法采用DIV块模板和表格模板结合的方法，建立双模板。利用基于中文分词的网页预处理结果，在领域本体知识的指导下，通过C4．5决策树算法来训练分类模型，筛选出待抽取的DIV块序号，构建DIV块模板，从而可以精确定位到数据块。利用XML技术构建XSLT文档，得到表格模板的抽取规则，从而抽取出数据片段。选取天气领域进行Deep Web信息抽取实验，实验结果表明，抽取准确率和召回率都可以达到95％以上，取得了较好的抽取效果。

英文摘要：

To simplify the extraction rules for the template to improve the extraction accuracy, an algorithm based on template and domain ontology was presented to extract Deep Web information. The combinations of DIV block template and table tem plate are used. Using the result of web page pretreatment based on Chinese word segmentation, under the guidance of domain ontology knowledge, by the algorithm of C4.5 decision tree to train the classifier, the number of extracted DIV blocks is selec ted, and the template of DIV blocks is built which can locate the data area. Then XSLT document is constructed using the tech nology of XML, and forming the table template helps extracting the data fragment. The result of the Deep Web information extraction experiment in the field of weather, show that average accuracy rate and recall rate can achieve above 95 ~ and better extraction effect is obtained.

同期刊论文项目