在进行语料的选取时,语料中的冗余信息包括词汇和句子层面的冗余.目前的方法主要集中在词汇层次的语料覆盖度进行选取,这种方法可以有效地降低词或者短语的信息冗余,但是没有考虑句子层次的覆盖度.为了从大规模的双语语料中选取较小规模的训练语料,得到与大规模训练相同甚至更优的翻译系统,基于双语句对覆盖度进行平行语料的选取,提出一种将unseen n-grams和编辑距离相结合进行语料的选取的方法.实验结果表明,该方法可以在使用较少训练语料的情况下,得到与原始训练翻译效果相同的翻译系统.
When making the selection of corpora, information includes not only redundancy at the vocabulary level but also redundancy at the sentential level. Present methods for this purpose are mainly focused on selecting corpora at the vocabulary level of coverage. These methods can effectively reduce the redundancy of words and phrases, but does not take into account the level of sentence coverage. Aiming at selecting a smaller training corpus from large-scale bilingual corpus, in order to get a the same or better translation system than the mass training data, the corpus from sentence coverage was mainly selected, by combining unseen n-grams method and edit distance. The experimental results show that the proposed method uses less training corpus, but still achieves almost equivalent performance compared with the original training corpus.