双语平行语料库是构造高质量统计机器翻译系统的重要基础。与传统的通过扩大双语平行语料库规模来提高翻译质量的策略不同,本文旨在尽可能地挖掘现有资源的潜力来提高统计机器翻译的性能。文中提出了一种基于信息检索模型的统计机器翻译训练数据选择与优化方法,通过选择现有训练数据资源中与待翻译文本相似的句子组成训练子集,可在不增加计算资源的情况下获得与使用全部数据相当甚至更优的机器翻译结果。通过将选择出的数据子集加入原始训练数据中优化训练数据的分布可进一步提高机器翻译的质量。实验证明,该方法对于有效利用现有数据资源提高统计机器翻译性能有很好的效果。
Parallel corpora are an indispensable resource for translation model training in statistical machine translation (SMT) system. Instead of collecting more and more parallel training corpora, this paper aims to improve the performance of SMT system by exploiting full potential of the existing parallel corpora. We propose an approach to select and optimize training corpus by using information retrieval method. First, sentences similar to the test text are selected to form a small and adapted training data. This allows us to get a comparable or even better performance with only a subset of the total data and the less hardware need, Second, we add the selected subset to the entire corpus to optimize the data distribution and get a better result, The experiments show that this method can effectively improve the performance of SMT system .