东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

藏文停用词选取与自动处理方法研究

ISSN号：1003-0077
期刊名称：《中文信息学报》
时间：0
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]西南交通大学信息科学与技术学院,四川成都610031, [2]西藏大学工学院计算机科学系,西藏拉萨850000
相关基金：国家自然基金（61262058,60763010）,CCF中文信息技术开放基金项目（CCF2012-02-01）,藏文信息技术教育部“长江学者与创新团队发展计划”（IRT0975）.

关键词：藏文停用词, 词频统计, 文档频数, 熵, Tibetan stop word, TF, DF, entropy

中文摘要：

停用词的处理是文本挖掘中一个关键的预处理步骤。该文结合现有停用词的处理技术，研究了基于统计的藏文停用词选取方法，通过实验分析了词项频率、文档频率、熵等方法的藏文停用词选用情况，提出了藏文虚词、特殊动词和自动处理方法相结合的藏文停用词选取方法。实验结果表明，该方法可以确定一个较合理的藏文停用词表。

英文摘要：

Stop words processing is a key preprocessing step in the text mining. In this paper, the selection method of stop words in Tibetan based on statistics is studied by combining with the existing techniques. Through experiments, TF, DF, and entropy calculation methods in the selection of Tibetan stop words are analyzed. An approach for the selection of Tibetan stop words is presented by the combination of Tibetan function words, special verb and automatic approach. The experimental results show that the proposed method can determine a reasonable Tibetan stop words list.

同期刊论文项目