科技术语自动抽取是中文信息处理领域的一个重要研究课题,在信息检索、机器翻译等领域,特别是在专利翻译中有着广泛应用。结合专利翻译任务,主要研究专利中科技术语的识别方法,在分析目前已有方法的基础之上,提出了一种使用条件随机场模型进行标注识别,并结合规则对错误识别结果进行后处理的科技术语识别方法。实验结果表明,提出的统计和规则相结合的识别方法是有效的,开放测试结果F值达到了84.4%。
Technical term automatic extraction is one of the important topics in Chinese information processing.It has been widely applied to information retrieval,machine translatlon,especially in the patent machine translation.In this paper, the research mainly focuses on the recognizing method of the technical term combined the patent machine translation task,proposes a technical term recognition method based on the statistics and rules at the base of the analysis of existed method.It first uses Conditional Random Fields (CRF) model to label and recognize the corpus,then a post-processing step based on rules is used to correct the wrong labeled result.The experiment results show the method is efficient for identifying technical terms,in open test the F-value reaches 84.4%.