在分析现有的藏语自动分词方法基础上,该文通过分析藏文构词规则、句法结构、词的前后词性关系、后加字(R)的添接法和格助词的用法等来重点研究了未登录词、紧缩词和交集型歧义的识别及处理方法,并提出了“重组法”,“排除—还原法”和“词性规则法”三种方法.经测试,在文学类、诗歌类、医学类和新闻类等大小为1M的藏语语料中未登录词、紧缩词和交集型歧义的识别准确率分别达到99.84%、99.95%和92.02%.
This paper analyses Tibetan word formation rules, syntactic structures, adjacent Part-Of-Speeches, the pattern of the suffix character ‘ ’ as well as the usage of case-auxiliary words. Focusing on the processing of out-of- vocabulary words, abbreviations and overlapping ambiguities, three methods are proposed as the re-combination method the exclusion-restoration method, and the POS rule method, respectively. Experiments on a 1M Tibetan corpus of literature, poetry, medicine and news indicate the precision of the above methods are 99.84%, 99.95% and 92.02 %, respectively.