该文研究的目的是在待翻译文本未知的情况下,从已有的大规模平行语料中选取一个高质量的子集作为统计机器翻译系统的训练语料,以降低训练和解码代价。该文综合覆盖度和句对翻译质量两方面因素,提出一种从已有平行语料中获取高质量小规模训练子集的方法。在CWMT2008汉英翻译任务上的实验结果表明,利用本文的方法能够从现有大规模语料中选取高质量的子集,在减少80%训练语料的情况下达到与Baseline系统(使用全部训练语料)相当的翻译性能(BLEU值)。
In Statistical Machine Translation,effective selection of training data can generally reduce the burden of system training and decoding.To addressing this issue,,we propose a framework to select a small portion from the whole training data set for SMT by considering both coverage and sentence pair quality.Experimental results on CWMT2008 Chinese-to-English MT task show that our framework is effective to select a subset from the large training data set.Even trained on the 20% data selected by our framework,the SMT system can achieve comparable performance with the baseline system trained on all the data).