词性标注有很多不同的研究方法,目前的维吾尔语词性标注方法都以基于规则的方法为主,其准确程度尚不能完全令人满意。在大规模人工标注的语料库的基础之上,研究了基于Ⅳ元语言模型的维吾尔语词性自动标注的方法,分析了N元语言模型参数的选取以及数据平滑,比较了二元、三元文法模型对维吾尔语词性标注的效率;研究了标注集和训练语料规模对词性标注正确率的影响。实验结果表明,用该方法对维吾尔语进行词性标注有良好的效果。
There are many approaches to the problem of part-of-speech tagging, current Uyghur part-of-speech tag- ging is mainly based on rule based methods and does not achieve the state-of-art accuracy. A large scale of manually annotated Uyghur corpus and a number of well-conducted experiments are used to identify the efficiency of N-gram based part-of-speech tagging scheme for Uyghur texts. The N-gram language model parameters and data smoothing are analyzed, and the efficiency of Bigram and Trigram models are compared. The impacts of tag sets and size of training data on tagging accuracy are studied. The experiments show that N-gram based part-of-speech tagging for Uyghur texts has achieved good results.