指代消解是自然语言处理中的一个重要问题,包括专有名词、普通名词、代词的指代识别。本文实现了一个基于机器学习的英语名词短语的指代消解平台,通过对原始语料进行命名实体识别和名词短语识别等一系列预处理,选取了多个有效特征及其组合,分别采用最大熵和SVM两种分类算法对名词短语进行分类,在此基础上着重研究了距离特征对指代消解的影响。在传统的基于机器学习的指代消解研究方法中,候选词和先行语的距离被定义为特征,而没有考虑距离在生成训练样例中的作用,本文通过把候选词和先行语的距离作为一个特征加入机器学习算法和作为限制条件用于指代关系候选实例的产生两方面进行详细研究,在MUC-6基准语料库上评测,实验结果表明,合理利用距离特征能够大大提高系统的性能。最终,本文采用最大熵和SVM两种分类器在测试集上分别获得了67.5和68.7的F1值,该结果优于同类型的其他系统。
Anaphora resolution plays an important role in natural language processing, which involves recognition of named entities, nominal phrase and pronoun anaphora etc. This paper presents a machine learning approach to anaphora resolution with special focus on the distance information between the anaphor and the antecedent candidate. Traditionally, the distance between anaphor and candidate is only adopted as a feature in machine learning approaches, without taking into account its contribution in the antecedent candidate generation. In this paper, the distance information is explored in details by either incorporating it as a feature in the learning algorithm (such as the maximum entropy model and the SVM model) or applying it as a hard constraint in the antecedent candidate generation. Evaluation on the MUC-6 benchmark corpus shows that proper handling of the distance information can much improve the performance and our system achieves the F1-measure of 68.7, which outperforms other similar systems.