命名实体翻译等价对在跨语言信息处理中具有非常重要的应用价值,然而由于语料资源的有限性,国内外关于汉柬命名实体等价对的抽取方法还没有深入研究。论文从可比语料文本出发,根据不同类型实体要素的特点以及在可比语料中的特点,选取了柬文命名实体到中文命名实体的音译特征、翻译特征、可比语料中命名实体的上下文特征及自身的长度特征,提出了一种基于多特征融合来计算相似度的方法来挖掘汉柬双语命名实体等价对。实验表明该方法取得了比较好的效果,其中挖掘人名实体对的准确率达到76%,召回率达到66%,证明了该方法要优于只采用单一特征的方法。
Named entity translation equivalent has been playing a significant role in the processing of cross-language information.However limited by the corpora resource,few in-depth studies have been made on the extraction of the bilingual Chinese-Khmer named entity equivalents.Starting from the comparable corpus text,according to the type of entity characteristics and comparable corpus characteristics,the paper selects transliteration feature,translation feature,context feature of the bilingual Chinese-Khmer named entity equivalents and length feature.So a method based on multi-feature fusion is proposed to calculate the similarity to excavate the bilingual Chinese-Khmer named entity equivalents.The experiment shows this method has a good performance when the bilingual Chinese-Khmer named entity equivalents are acquired through the computation of feature similarity,turning out that the method proposed in this paper is able to give better effect compared with the method using only a single feature.