英语-柬埔寨语双语平行语料库资源是柬埔寨语信息处理的一项基础资源,对于促进柬埔寨语言信息处理技术的发展具有非常重要的研究意义。在得到平行的双语网页之后,将平行句对的获取问题看作是对候选平行句对的分类问题。为了从候选平行句对中识别出平行句对,构造了二分类的最大熵分类器。采用句子长度特征、词汇化比例特征、句子位置特征、符号特征等进行英柬双语句对分类器的训练。最后利用该分类器对英柬候选平行句对进行分类,从而确定出英柬双语平行句对资源。实验结果表明通过加入不同的特征进行比较,最终的准确率和召回率达到了90%以上,证明利用该分类器进行平行句对识别取得了比较好的效果。
English-Khmer bilingual parallel corpora is a basic resource of the Khmer information processing,and it is very important to promote the development of the Khmer information processing.The issue of obtaining the parallel sentence pairs is regarded as classification of candidate parallel sentence pairs after obtaining the parallel bilingual website.We construct a maximum entropy classifier to identify the parallel sentence pairs from the candidates.We train the English-Khmer bilingual sentence pairs classifier by adopting the features of the sentence length,the ratio of the characteristic vocabulary,the sentence position and the characteristics.Finally,we use this English-Khmer bilingual classifier to classify the candidate English-Khmer parallel sentence pairs,thus we can determine the resources of English-Khmer parallel bilingual sentence pairs.The experiment shows that compared with the ones with differert features,the classer has a high precision and recall rate that is more than 90 percent at last.It suggests that it can have a better performance by identifying the parallel sentences.