东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于最大熵模型的英柬双语平行句对获取

ISSN号：0253-2395
期刊名称：《山西大学学报：自然科学版》
时间：0
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]昆明理工大学信息工程与自动化学院,昆明650500, [2]昆明理工大学智能信息处理重点实验室,昆明650500
相关基金：国家自然科学基金（NO.61462055;61472168）; 云南省自然科学基金重点项目2013FA130

作者：严馨[1,2], 王若兰[1,2], 余正涛[1,2], 潘丽同, 郭剑毅[1,2]

关键词：英柬双语平行语料库, 最大熵分类器, 平行句对, English-Khmer bilingual parallel corpora, Maximum entropy classifier, Parallel sentence pairs

中文摘要：

英语-柬埔寨语双语平行语料库资源是柬埔寨语信息处理的一项基础资源,对于促进柬埔寨语言信息处理技术的发展具有非常重要的研究意义。在得到平行的双语网页之后,将平行句对的获取问题看作是对候选平行句对的分类问题。为了从候选平行句对中识别出平行句对,构造了二分类的最大熵分类器。采用句子长度特征、词汇化比例特征、句子位置特征、符号特征等进行英柬双语句对分类器的训练。最后利用该分类器对英柬候选平行句对进行分类,从而确定出英柬双语平行句对资源。实验结果表明通过加入不同的特征进行比较,最终的准确率和召回率达到了90%以上,证明利用该分类器进行平行句对识别取得了比较好的效果。

英文摘要：

English-Khmer bilingual parallel corpora is a basic resource of the Khmer information processing,and it is very important to promote the development of the Khmer information processing.The issue of obtaining the parallel sentence pairs is regarded as classification of candidate parallel sentence pairs after obtaining the parallel bilingual website.We construct a maximum entropy classifier to identify the parallel sentence pairs from the candidates.We train the English-Khmer bilingual sentence pairs classifier by adopting the features of the sentence length,the ratio of the characteristic vocabulary,the sentence position and the characteristics.Finally,we use this English-Khmer bilingual classifier to classify the candidate English-Khmer parallel sentence pairs,thus we can determine the resources of English-Khmer parallel bilingual sentence pairs.The experiment shows that compared with the ones with differert features,the classer has a high precision and recall rate that is more than 90 percent at last.It suggests that it can have a better performance by identifying the parallel sentences.

同期刊论文项目