停用词的处理是文本挖掘中一个关键的预处理步骤。该文结合现有停用词的处理技术,研究了基于统计的藏文停用词选取方法,通过实验分析了词项频率、文档频率、熵等方法的藏文停用词选用情况,提出了藏文虚词、特殊动词和自动处理方法相结合的藏文停用词选取方法。实验结果表明,该方法可以确定一个较合理的藏文停用词表。
Stop words processing is a key preprocessing step in the text mining. In this paper, the selection method of stop words in Tibetan based on statistics is studied by combining with the existing techniques. Through experiments, TF, DF, and entropy calculation methods in the selection of Tibetan stop words are analyzed. An approach for the selection of Tibetan stop words is presented by the combination of Tibetan function words, special verb and automatic approach. The experimental results show that the proposed method can determine a reasonable Tibetan stop words list.