东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

指代消解中距离特征的研究

ISSN号：1003-0077
期刊名称：中文信息学报
时间：0
页码：39-44
语言：中文
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]苏州大学计算机科学与技术学院,江苏省计算机信息处理技术重点实验室,江苏苏州215006
相关基金：国家自然科学基金资助项目（60673041）;国家863计划资助（2006AA01Z147）
相关项目：基于机器学习的高性能自适应信息抽取关键技术研究

关键词：计算机应用, 中文信息处理, 指代消解, 机器学习, 距离特征, 最大熵分类器, SVM分类器, computer application, Chinese information processing, anaphora resolution, machine learning, the distance information, maximum entropy model, SVM model

中文摘要：

指代消解是自然语言处理中的一个重要问题，包括专有名词、普通名词、代词的指代识别。本文实现了一个基于机器学习的英语名词短语的指代消解平台，通过对原始语料进行命名实体识别和名词短语识别等一系列预处理，选取了多个有效特征及其组合，分别采用最大熵和SVM两种分类算法对名词短语进行分类，在此基础上着重研究了距离特征对指代消解的影响。在传统的基于机器学习的指代消解研究方法中，候选词和先行语的距离被定义为特征，而没有考虑距离在生成训练样例中的作用，本文通过把候选词和先行语的距离作为一个特征加入机器学习算法和作为限制条件用于指代关系候选实例的产生两方面进行详细研究，在MUC-6基准语料库上评测，实验结果表明，合理利用距离特征能够大大提高系统的性能。最终，本文采用最大熵和SVM两种分类器在测试集上分别获得了67．5和68．7的F1值，该结果优于同类型的其他系统。

英文摘要：

Anaphora resolution plays an important role in natural language processing, which involves recognition of named entities, nominal phrase and pronoun anaphora etc. This paper presents a machine learning approach to anaphora resolution with special focus on the distance information between the anaphor and the antecedent candidate. Traditionally, the distance between anaphor and candidate is only adopted as a feature in machine learning approaches, without taking into account its contribution in the antecedent candidate generation. In this paper, the distance information is explored in details by either incorporating it as a feature in the learning algorithm （such as the maximum entropy model and the SVM model） or applying it as a hard constraint in the antecedent candidate generation. Evaluation on the MUC-6 benchmark corpus shows that proper handling of the distance information can much improve the performance and our system achieves the F1-measure of 68.7, which outperforms other similar systems.

同期刊论文项目