针对领域术语抽取中含字长度较大的术语被错误切分的问题,本文提出一种基于术语长度和语法特征的统计领域术语抽取方法。本方法在利用机器学习抽取候选术语时,加入基于术语长度和语法特征的约束规则;在使用统计方法确定候选术语的领域性时,充分考虑词长比这一概念的重要性,将其作为判断术语领域性的重要权值。实验表明,提出的方法能够正确抽取含字长度较大的领域术语,抽取结果的准确率和召回率相比以往的方法有所提高。
A statistical domain terminology extraction method based on word length and grammatical feature is proposed to resolve the incorrect segmentation of long terminology. Constraint rules based on word length and grammatical feature are added in when machine learning is utilized to extract candidate terminology. When a statistical method is used to determine the domain of candidate terminology, the importance of the concept of word length ratio is fully considered and is used as an important weight for judging the terminology domain. The experiment shows that long terminology can be correctly extracted through this method. Moreover, the precision and recall rate of the proposed extraction method are superior to those of traditional methods.