东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于约束条件随机场的Web数据语义标注

ISSN号：1000-1239
期刊名称：计算机研究与发展
时间：2012.2
页码：361-371
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]山东大学计算机科学与技术学院,济南250100, [2]徐州师范大学计算机科学与技术学院,江苏徐州221116
相关基金：国家自然科学基金项目（61100167,90818001）;江苏省自然科学基金项目（BK2011204）;江苏省高校自然科学基金项目（11KJB520019）;山东省自然科学基金项目（Y2007G24）
相关项目：Deep Web数据集成查询结果抽取与整合关键技术研究

关键词：语义标注, WEB信息抽取, 条件随机场, 整数线性规划, WEB数据集成, semantic annotation, Web information extraction, conditional random field, integer linear programming, Web data integration

中文摘要：

Web数据语义标注是Web信息抽取中的关键步骤．条件随机场是利用序列特征处理序列标注问题的经典方法．然而现有条件随机场模型无法综合利用已有的Web数据库信息和Web数据元素之间的逻辑关系，导致Web数据语义标注准确率不高．因此，提出一种约束条件随机场模型（CCRF）．该模型通过引入可信约束和逻辑约束，有效利用了已有的Web数据库信息和Web数据元素之间的逻辑关系．为了克服现有条件随机场模型Viterbi推理方法无法综合利用这2类约束的不足，该模型采用整数线性规划推理方法，将两类约束同时引入推理过程．通过在多个领域的真实数据集上的实验结果表明，所提出的模型能够显著提高Web数据语义标注的性能，并且为Web信息抽取奠定了良好的基础．

英文摘要：

Semantic annotation of Web data is a key step for Web information extraction. The goal of semantic annotation is to assign meaningful semantic labels to data elements of the extracted Web object. It is a hot research topic that has gained increasing attention all over the world in recent years. Conditional random fields are the state-of-the-art approaches taking the sequence characteristics to do better labeling. However, traditional conditional random fields can not simultaneously use existing Web databases and logical relationships among Web data elements, which lead to low precision of Web data semantic annotation. To solve the problems, this paper presents a constrained conditional random fields （CCRF） model to annotate Web data. The model incorporates confidence constraints and logical constraints to efficiently utilize existing Web databases and logical relationships among Web data elements. In order to solve the problem that the Viterbi inference approach of traditional CRF model can not simultaneously utilize two kinds of constraints, the model incorporates a novel inference procedure based on integer linear programming and extends CRF to naturally and efficiently support two kinds of constraints. Experimental results on a large number of real-world data collected from diverse domains show that the proposed approach significantly improves the accuracy of semantic annotation of Web data, and lays a solid foundation for Web information extraction.

同期刊论文项目