为简化模板的抽取规则、提高抽取的准确率,提出了一种基于双模板和领域本体的DeepWeb信息抽取方法。该方法采用DIV块模板和表格模板结合的方法,建立双模板。利用基于中文分词的网页预处理结果,在领域本体知识的指导下,通过C4.5决策树算法来训练分类模型,筛选出待抽取的DIV块序号,构建DIV块模板,从而可以精确定位到数据块。利用XML技术构建XSLT文档,得到表格模板的抽取规则,从而抽取出数据片段。选取天气领域进行Deep Web信息抽取实验,实验结果表明,抽取准确率和召回率都可以达到95%以上,取得了较好的抽取效果。
To simplify the extraction rules for the template to improve the extraction accuracy, an algorithm based on template and domain ontology was presented to extract Deep Web information. The combinations of DIV block template and table tem plate are used. Using the result of web page pretreatment based on Chinese word segmentation, under the guidance of domain ontology knowledge, by the algorithm of C4.5 decision tree to train the classifier, the number of extracted DIV blocks is selec ted, and the template of DIV blocks is built which can locate the data area. Then XSLT document is constructed using the tech nology of XML, and forming the table template helps extracting the data fragment. The result of the Deep Web information extraction experiment in the field of weather, show that average accuracy rate and recall rate can achieve above 95 ~ and better extraction effect is obtained.