在借鉴汉语基于语料的词抽取技术研究成果的基础上,给出藏语文本预处理方法,并提出一种基于语料库的藏语高频词抽取算法,其中包括藏语文本预处理用噪音字表、紧缩词及其预处理方法和基于语料库的藏语高频词抽取算法。实验结果表明,该算法的准确率达86.22%,召回率达89.79%,F值达87.94%。
Based on the research foundlings of Chinese corpus’ extraction,this paper presents the Tibetan preprocessing method and the high-frequency words extraction algorithm,which consists of the tables of noise words,tighten-word,preprocessing method and high-frequency words extraction algorithm.Experimental results show that this algorithm achieves a precision of 86.22%,a recall of 89.79%,and an F-measure of 87.94%.