该文针对维吾尔语的音变现象,提出了一种自动还原模型。与以往方法不同的是,此模型中我们把音变现象泛化,先假设维吾尔语中所有语音都有音变现象,从而将还原问题转化为类似于词性标注问题,再利用标注的方法解决了还原操作。在新疆多语种信息技术重点实验室手工标注的《维吾尔语百万词词法分析语料库》上做了实验,还原模块作为维吾尔语词法分析器的一部分,把词法分析器功能的F值从84.1%提高到了91.4%,同时维吾尔语中词缀数目最多、变形情况最复杂的动词词干的还原正确率也达到了88.6%,实际应用中完全可以被接受。
We propose an automatic lemmatization model for Uyghur inflectional phenomenon.In contrast to previous methods,we generalize the inflection in Uyghur conceptually,and treat the lemmatization with the sequence tagging models,.Using the "Uyghur million word Part-of-Speech tagging corpus" as the training data,the proposed method improves the F value of lemmatization up to 91.4% from 84.1%,especially attaining an F value of 88.6% for Uyghur verbs which are rich in suffixes and complex.