针对Web网页中事物描述信息的特点,提出了一种通过本体指导网页信息抽取的方法。首先建立抽取对象的本体模型,并为本体属性概念添加定位信息映射模型,通过映射模型定位和分离样本页中包含语义信息的数据块,结合路径分析算法生成抽取规则,之后利用抽取规则对同类网页中的事物描述信息进行抽取,最后以资源描述框架(RDF)数据格式储存信息。抽取性能测试实验表明,抽取结果具有较高的准确率,与无规则抽取方法相比,具有更高的抽取效率。
With the aim of identifying the features of thing-descriptive information contained in web pages, a novel approach of web page information extraction guided by ontology is proposed in this paper. The method first adds a mapping model to the properties of the pre-built ontology concept. Then, it separates the semantic data block from the sample page with location information in the mapping model, and creates extraction rules using a path analysis algorithm. Lastly the related information records are extracted from similar web pages according to the extraction rules and stored in resource description framework (RDF) format. The experimental results indicate that in comparison with the ruleless extraction method, the method gives a better accuracy according to the outcome of the extraction, and performs more efficiently.