Web数据语义标注是Web信息抽取中的关键步骤.条件随机场是利用序列特征处理序列标注问题的经典方法.然而现有条件随机场模型无法综合利用已有的Web数据库信息和Web数据元素之间的逻辑关系,导致Web数据语义标注准确率不高.因此,提出一种约束条件随机场模型(CCRF).该模型通过引入可信约束和逻辑约束,有效利用了已有的Web数据库信息和Web数据元素之间的逻辑关系.为了克服现有条件随机场模型Viterbi推理方法无法综合利用这2类约束的不足,该模型采用整数线性规划推理方法,将两类约束同时引入推理过程.通过在多个领域的真实数据集上的实验结果表明,所提出的模型能够显著提高Web数据语义标注的性能,并且为Web信息抽取奠定了良好的基础.
Semantic annotation of Web data is a key step for Web information extraction. The goal of semantic annotation is to assign meaningful semantic labels to data elements of the extracted Web object. It is a hot research topic that has gained increasing attention all over the world in recent years. Conditional random fields are the state-of-the-art approaches taking the sequence characteristics to do better labeling. However, traditional conditional random fields can not simultaneously use existing Web databases and logical relationships among Web data elements, which lead to low precision of Web data semantic annotation. To solve the problems, this paper presents a constrained conditional random fields (CCRF) model to annotate Web data. The model incorporates confidence constraints and logical constraints to efficiently utilize existing Web databases and logical relationships among Web data elements. In order to solve the problem that the Viterbi inference approach of traditional CRF model can not simultaneously utilize two kinds of constraints, the model incorporates a novel inference procedure based on integer linear programming and extends CRF to naturally and efficiently support two kinds of constraints. Experimental results on a large number of real-world data collected from diverse domains show that the proposed approach significantly improves the accuracy of semantic annotation of Web data, and lays a solid foundation for Web information extraction.