大规模高质量双语平行语料库是构造高质量统计机器翻译系统的重要基础,但语料库中的噪声影响着统计机器翻译系统的性能,因此有必要对大规模语料库中语料进行筛选.区别于传统的语料选择排序模型,本文提出一种基于分类的平行语料选择方法.通过少数句对特征构造差异较大的分类器训练句对,在该训练句对上使用更多的句对特征对分类器进行训练,然后对其他未分类句对进行分类.相比于基准系统,我们的方法不仅缩减40%训练语料规模,同时在NIST测试数据集合上将BLEU值提高了0.87个百分点.
Large-scale bilingual corpus is a fundamental resource to build a high-quality statistical machine translation system. However, there are usually a large number of noises in the corpus, which would affect the performance of translation system. Therefore, it is essential to filter noisy sentences. In this paper, we propose a classification based selection approach to distinguish high-quality bilingual sentences from the noisy ones. We first exploit several metrics to find the best and worst sentences in the corpus. Then we classify the rest sentences with the classifier, which is trained with more features on these sentences. Experimental results show that our approach not only eliminates 400/00 less promising sentences, but also significantly improves translation performance by 0.87 BLEU points over using all sentences.