东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

统计机器翻译中短语切分的新方法

ISSN号：1003-0077
期刊名称：中文信息学报
时间：0
页码：85-89
语言：中文
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]中国科学院计算技术研究所,北京100080, [2]中国科学院研究生院,北京100039
相关基金：国家863计划资助项目（2005AA114140）;国家自然科学基金资助项目（60573188）
相关项目：基于短语结构转换模板的统计机器翻译方法研究

关键词：人工智能, 机器翻译, 统计机器翻译, 翻译模型, 短语切分, artificial intelligence, machine translation, statistical machine translation, translation model, phrase segmentation

中文摘要：

基于短语的统计机器翻译是目前主流的一种统计机器翻译方法，但是目前基于短语的翻译系统都没有对短语切分作专门处理，认为一个句子的所有短语切分都是等概率的。本文提出了一种短语切分方法，将句子的短语切分概率化：首先，识别出汉语语料库中所有出现次数大于2次的词语串，将其作为汉语短语;其次，用最短路径方法进行短语切分，并利用Viterbi算法迭代统计短语的出现频率。在2005年863汉英机器翻译评测测试集上的实验蛄果（BLEU4）是：0．1764（篇章），0．2231（对话）。实验表明，对于长句子（如篇章），短语切分模型的加入有助于提高翻译质量，比原来约提高了0．5个百分点。

英文摘要：

Currently, Phrase-based Statistical Machine Translation is the state-of-the-art method in SMT community. However, none of the phrase-based systems has the special module to deal with the phrase segmentation, they consider all segmentations of a sentence with uniform distribution. In this paper, we proposed a phrase segmentation method, Firstly, find the word strings occur more than once in Chinese corpus, which are considered as Chinese phrases, Secondly, use the Shortest-Path method to do phrase segmentation, and employ Viterbi algorithm to train iteratively to gain the phrase probability. We do experiments on 2005 HTRDP （863） MT evaluation test set. Using the phrase segmentation model, the results （BLEU4） are.. 0. 1764 （writing） and 0. 2231 （dialog）. Experiments show that the phrase segmentation model can help to improve translation quality on long sentences. We get about 0. 5 percentage point increase on writing.

同期刊论文项目