属性抽取可分为对齐和语义标注两个过程,现有对齐方法中部分含有相同标签不同语义的属性会错分到同一个组,而且为了提高语义标注的精度,通常需要大量的人工标注训练集.为此,文中提出结合主动学习的多记录网页属性抽取方法.针对属性错分问题,引入属性的浅层语义,减少相同标签语义不一致的影响.在语义标注阶段,基于网页的文本、视觉和全局特征,采用基于主动学习的SVM分类方法获得带有语义的结构化数据.同时在主动学习的策略选择方面,通过引入样本整体信息,构建基于不确定性度量的策略,选择语义分类预测不准的样本进行标注.实验表明,在论坛、微博等多个数据集上,相比现有方法,文中方法抽取效果更好.
The attribute extraction process can be separated into two phases, alignment and annotation. In the existing alignment methods, different semantic attributes are mistakenly aligned into the same group. Furthermore, to improve the accuracy of semantic annotation, time-consuming manual annotation is oftenintroduced to construct training set. To solve this problem, a multi-record webpage attribute extraction method combining active learning is presented. As for the problem of wrong attribute alignment, shallow semantic is integrated into the alignment approach to relieve the influence of same tags with different semantics. In the semantic annotation phase, textual, visual and global features are extracted for semantic classification and an active learning based SVM classifier is applied to extract structural data. Moreover, a new sample selection strategy is proposed by introducing the global sample information, and more informative samples with lower confidences are selected to be labeled. The experimental results on BBS and microblog datasets confirm the superiority the proposed method.