针对高棉语分词及词性标注问题,提出一种基于层叠条件随机场模型的自动分词及词性标注方法。该方法由三层条件随机场模型构成:第一层是分词模型,该模型以字符簇为粒度,结合上下文信息与高棉语的构词特点构建特征模板,实现对高棉语句子的自动分词;第二层是分词结果修正模型,该模型以词语为粒度,结合上下文信息与高棉语中命名实体的构成特点构建特征模板,实现对第一层分词结果的修正;第三层是词性标注模型,该模型以词语为粒度,结合上下文信息与高棉语丰富的词缀信息构建特征模板,实现对高棉语句子中的词语进行自动标注词性。基于该模型进行开放测试实验,最终准确率为95.44%,结果表明该方法能有效解决高棉语的分词和词性标注问题。
This paper presents a Khmer automatic word segmentation and POS tagging method based on Cascaded Conditional Random Fields(CCRFs)model.The approach consists of three layers of Conditional Random Fields(CRFs)models:the first layer is the word segmentation model in Khmer character cluster(KCC)granularity,integrating the word formation characteristics of Khmer into the feature template;the second layer is the word segmentation correction model in word granularity,integrating the characteristic of Khmer named entities into the feature template;the third layer is the POS tagging model,integrating the rich affixes information into the feature template,and achieved the Khmer POS tagging.We experimented on an open corpus and obtained a final accuracy rate of 95.44%,indicating that the proposed method can effectively solve the Khmer word segmentation and POS tagging problems.