为了探讨蒙古文自动词切分及词性标注的问题,可以首先对20万词级蒙古文语料的词切分和词性标注情况进行统计和分析,并对其切分和标注错误进行二次修正,然后再采用条件随机场模型(CRF),进行自动"分词"、"词性标注"、分词及词性标注"统一实现"的研究。开放测试的结果表明,蒙古文自动分词准确率在98%以上,蒙古文分词和词性标注"统一实现"实验结果的准确率比分词和词性标注"两步走"实验结果的准确率高出3.55%,"统一实现"实验在考虑"上下文"和特征"连写的附加成分"后所得准确率可以达到93.38%,这在一定程度上解决了蒙古文分词及词性标注问题。
This paper explores the Mongolian word segmentation and POS tagging problems based on 200 thousand Mongolian words corpus. The Mongolian words corpus is firstly analyzed after manual segmentation and POS tagging. Then the Conditional Random Fields model( CRF) is adopted for the word segmentation,POS tagging,and a unified process of word segmentation and POS tagging respectively. Findings in the open test show that the precision of word segmentation is more than 98%; the precision of " unified process"( unified process of word segmentation and POS tagging) is 3. 55% higher than that of " two-step"( word segmentation firstly,then POS tagging); and the precision of " unified process" can reach 93. 38% considering the context and characteristics of the " agglutinative word-formation suffix",which to some extent solves the problems of Mongolian word segmentation and POS tagging.