该文结合最小上下文构成代价模型,借鉴并利用统计机器翻译的方法,尝试解决蒙古文形态切分问题。基于短语的统计机器翻译形态蒙文切分模型和最小上下文构成代价模型分别对词表词和未登录词进行形态切分。前者选取了短语机器翻译系统中三个常用的模型,包括短语翻译模型、词汇化翻译模型和语言模型,最小上下文构成代价模型考虑了一元词素上下文环境和词缀N-gram上下文环境。实验结果显示:基于短语统计机器翻译形态切分模型对词表词切分,最小上下文构成代价模型对未登录词处理后,总体的切分准确率达到96.94%。此外,词素融入机器翻译系统中后,译文质量有了显著的提高,更进一步的证实了本方法的有效性和实用性。
This paper presents a Mongolian morphological segmentation approach by statistical machine translation method and minimum constituent-context cost model.The phrase based statistical machine translation and minimum constituent-context cost model are adopted to deal with in-vocabulary and out-of-vocabulary morphological segmentation,respectively.Three features commonly used in phrase based statistical machine translation were selected for the segmentation,i.e.the phrase translation probability,the lexical translation probability and the language model score.The uni-gram morpheme context and N-gram suffix context are considered in the minimum constituent-context cost model.Experiments show that the precision of the morphological segmentation system achieves 96.94%,and the translation results of the statistical machine translation system is improved obviously.