东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于短语统计机器翻译模型蒙古文形态切分

ISSN号：1003-0077
期刊名称：《中文信息学报》
时间：0
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]中国科学院合肥智能机械研究所,安徽合肥230031, [2]中国科学技术大学自动化系,安徽合肥230027, [3]大同电力高级技工学校,山西大同037039
相关基金：国家自然科学基金资助项目（61070099）; 国家科技支撑计划资助项目（2009BAH41B06）

作者：李文[1,2], 李淼[1], 梁青[3], 朱海[1,2], 应玉龙[1,2], 乌达巴拉[1]

关键词：形态学, 形态切分, 机器翻译, 统计模型, morphology, morphological segmentation, machine translation, statistical model

中文摘要：

该文结合最小上下文构成代价模型,借鉴并利用统计机器翻译的方法,尝试解决蒙古文形态切分问题。基于短语的统计机器翻译形态蒙文切分模型和最小上下文构成代价模型分别对词表词和未登录词进行形态切分。前者选取了短语机器翻译系统中三个常用的模型,包括短语翻译模型、词汇化翻译模型和语言模型,最小上下文构成代价模型考虑了一元词素上下文环境和词缀N-gram上下文环境。实验结果显示：基于短语统计机器翻译形态切分模型对词表词切分,最小上下文构成代价模型对未登录词处理后,总体的切分准确率达到96.94%。此外,词素融入机器翻译系统中后,译文质量有了显著的提高,更进一步的证实了本方法的有效性和实用性。

英文摘要：

This paper presents a Mongolian morphological segmentation approach by statistical machine translation method and minimum constituent-context cost model.The phrase based statistical machine translation and minimum constituent-context cost model are adopted to deal with in-vocabulary and out-of-vocabulary morphological segmentation,respectively.Three features commonly used in phrase based statistical machine translation were selected for the segmentation,i.e.the phrase translation probability,the lexical translation probability and the language model score.The uni-gram morpheme context and N-gram suffix context are considered in the minimum constituent-context cost model.Experiments show that the precision of the morphological segmentation system achieves 96.94%,and the translation results of the statistical machine translation system is improved obviously.

同期刊论文项目