该文以英语作为中间语言的方式对在没有直接的外国语至汉语平行训练数据条件下构建统计机器翻译系统的问题进行研究.文中将基于中间语言的机器翻译方法分为系统级、语料级以及短语级中间语3种方法.在文中提出的改进的语料级中间语方法中,通过扩大生成训练数据的规模以及优化词对齐质量的方式来提高翻译系统的翻译性能.在传统的短语级中间语方法中,由于存在无法进行融合的中间语短语从而导致很多高质量短语对无法生成的问题,该文提出的改进方法通过解码生成的方式来扩大短语翻译表,继而提高翻译质量.该文系统地比较了3种中间语方法的优缺点,通过人工分析发现,任何一种方法无法在所有的翻译任务上取得最佳的翻译性能,故文中提出了语料级-短语级融合的中间语方法,该方法在所有翻译任务上取得了最优的翻译性能.最终,文中成功构建了孟加拉语、泰米尔语、乌兹别克语、匈牙利语至汉语的机器翻译系统.与基线系统相比,文中提出的方法在4种外国语的测试集上获得了0.8至2.8个BLEU点的上涨.
In this paper,we use English as the pivot language to build statistical machine translation systems as parallel training corpora for foreign languages and Chinese are non-existent.We classify the pivot language based methods into system-level,corpus-level,and phrase-level methods.For the proposed improved corpus-level method,we improve the translation performance through enlarging the size of bilingual training corpora and improving the quality of word alignments.For the typical phrase-level pivot language based method,as many high-quality phrase pairs cannot be generated from source-pivot and pivot-target phrase translation tables,we use decoding-generation method to enlarge the size of phrase pairs in phrase translation table and improve the translation performance.We analyze the strengths and weaknesses for system-level,corpus-level,and phrase-level pivot language based approaches during system construction,and we find that there is no one method can achieve the best translation performance among all the translation tasks through human analysis.Therefore we propose the corpus-phrase combination based pivot method which achieves the highest BLEU scores among all the translation tasks.We translate Bengali,Tamil,Uzbek,and Hungarian into Chinese with our proposed pivot language based methods.Finally,we observe significant improvements from 0.8to 2.8 BLEU points when translating Bengali,Tamil,Uzbek,and Hungarian on the test datasets compared with the baseline translation system.