东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于CRF模型的蒙古文分词及词性标注的研究

ISSN号：1000-5218
期刊名称：《内蒙古大学学报：哲学社会科学版》
时间：0
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]杭州师范大学国际教育学院,浙江杭州311121, [2]内蒙古大学图书馆, [3]内蒙古大学蒙古学学院,内蒙古呼和浩特010021
相关基金：国家社科基金重大项目（项目批准号：11＆ZD188）

关键词：蒙古文分词, 蒙古文词性标注, 条件随机场, Mongolian word segmentation, Mongolian part of speech（POS） tagging, CRF model

中文摘要：

为了探讨蒙古文自动词切分及词性标注的问题,可以首先对20万词级蒙古文语料的词切分和词性标注情况进行统计和分析,并对其切分和标注错误进行二次修正,然后再采用条件随机场模型（CRF）,进行自动＂分词＂、＂词性标注＂、分词及词性标注＂统一实现＂的研究。开放测试的结果表明,蒙古文自动分词准确率在98%以上,蒙古文分词和词性标注＂统一实现＂实验结果的准确率比分词和词性标注＂两步走＂实验结果的准确率高出3.55%,＂统一实现＂实验在考虑＂上下文＂和特征＂连写的附加成分＂后所得准确率可以达到93.38%,这在一定程度上解决了蒙古文分词及词性标注问题。

英文摘要：

This paper explores the Mongolian word segmentation and POS tagging problems based on 200 thousand Mongolian words corpus. The Mongolian words corpus is firstly analyzed after manual segmentation and POS tagging. Then the Conditional Random Fields model（ CRF） is adopted for the word segmentation,POS tagging,and a unified process of word segmentation and POS tagging respectively. Findings in the open test show that the precision of word segmentation is more than 98%; the precision of ＂ unified process＂（ unified process of word segmentation and POS tagging） is 3. 55% higher than that of ＂ two-step＂（ word segmentation firstly,then POS tagging）; and the precision of ＂ unified process＂ can reach 93. 38% considering the context and characteristics of the ＂ agglutinative word-formation suffix＂,which to some extent solves the problems of Mongolian word segmentation and POS tagging.

同期刊论文项目