从非结构化文本中抽取给定实体的属性及属性值,将属性抽取看作是一个序列标注问题.为避免人工标注训练语料,充分利用百度百科信息框(Infobox)已有的结构化内容,对非结构化文本回标自动产生训练数据.在得到训练语料后,结合中文特点,选取多维度特征训练序列标注模型,并利用上下文信息进一步提高系统性能,进而在非结构化文本中抽取出实体的属性及属性值.实验结果表明:该方法在百度百科多个类别中均有效;同时,该方法可以直接扩展到类似的非结构化文本中抽取属性.
An approach for extracting attribute-value pairs of a given entity has been proposed,regarding attribute-value extraction as a sequential data-labeling problem.In order to avoid label the corpus manually,the information in the Infoboxes of Baidu encyclopedia is used to label the unstructured text as the training data.After the training data was generated,multidimensional features are used to train the sequential data-labeling model,and then the performance is improved by using the context.Experiments shows that this method can be used in many classes of the Baidu encyclopedia,and this method can be also used in other websites.