东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于索引项权重的文本特征选择方法

ISSN号：1000-7024
期刊名称：《计算机工程与设计》
时间：0
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]通化师范学院数学系,吉林通化134002, [2]河海大学计算机及信息工程学院,江苏南京210098, [3]通化师范学院计算机科学系,吉林通化134002
相关基金：国家自然科学基金项目（60673186）.

关键词：文本分类, 特征选择, 索引项权重, 信息增益, 期望交叉熵, 文本证据权, text categorization, feature selection, term weight, information gain, expected cross entropy, weight of evidence for text

中文摘要：

为改善文本分类的效率和效果，降低计算复杂度，在分析了经典的特征选择方法后，提出加权的文本特征选择方法。该方法不仅利用数据集中文本的个数，还充分考虑到索引项的权重信息，并构造新的评估函数，改进了信息增益、期望交叉熵以及文本证据权。利用KNN分类器在Reuters一21578标准数据集上进行训练和测试。实验结果表明，该方法能够选出有效特征，提高文本分类的性能。

英文摘要：

To improve the efficiency and effectiveness and reduce computational complexity for text categorization, text feature selection with term weight is prop6sed based on the classical method. This method not only used the numbers of documents in datasets, but also fully took the information of term weight into account in the text. Thus, new evaluation function is constructed. It works better than information gain, expected cross entropy and weight of evidence for text. Using K-Nearest neighbor classifier, Reuters-21578 is used as standard data collection. Experimental results show that the new method select good features and effectively improve the performance of text categorization.

同期刊论文项目