基于短语的统计机器翻译是目前主流的一种统计机器翻译方法,但是目前基于短语的翻译系统都没有对短语切分作专门处理,认为一个句子的所有短语切分都是等概率的。本文提出了一种短语切分方法,将句子的短语切分概率化:首先,识别出汉语语料库中所有出现次数大于2次的词语串,将其作为汉语短语;其次,用最短路径方法进行短语切分,并利用Viterbi算法迭代统计短语的出现频率。在2005年863汉英机器翻译评测测试集上的实验蛄果(BLEU4)是:0.1764(篇章),0.2231(对话)。实验表明,对于长句子(如篇章),短语切分模型的加入有助于提高翻译质量,比原来约提高了0.5个百分点。
Currently, Phrase-based Statistical Machine Translation is the state-of-the-art method in SMT community. However, none of the phrase-based systems has the special module to deal with the phrase segmentation, they consider all segmentations of a sentence with uniform distribution. In this paper, we proposed a phrase segmentation method, Firstly, find the word strings occur more than once in Chinese corpus, which are considered as Chinese phrases, Secondly, use the Shortest-Path method to do phrase segmentation, and employ Viterbi algorithm to train iteratively to gain the phrase probability. We do experiments on 2005 HTRDP (863) MT evaluation test set. Using the phrase segmentation model, the results (BLEU4) are.. 0. 1764 (writing) and 0. 2231 (dialog). Experiments show that the phrase segmentation model can help to improve translation quality on long sentences. We get about 0. 5 percentage point increase on writing.