东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

文本分类中特征权重因子的作用研究

ISSN号：1003-0077
期刊名称：《中文信息学报》
时间：0
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]中国科学院计算技术研究所,北京100190, [2]北京语言大学,北京100083
相关基金：国家自然科学基金资助项目（60873166）;国家973资助项目（2007CB311103）;国家863计划资助项目（2006AA010105）

关键词：计算机应用, 中文信息处理, 文本分类, 权重表示, 权重因子作用, VSM, computer application, Chinese information processing, text categorization, term weighting, effects of weighting factors, VSM

中文摘要：

在传统的基于向量空间的文本分类中，特征权重计算与特征选择过程完全割裂，特征选择函数的得分能反映特征的重要性，却未被纳入权重表示，造成特征表示不精确并影响分类性能。一些改进方法使用特征选择函数等修改TFIDF模型，提高了分类性能，但没有探究各权重因子如何影响分类的性能。该文以词频、逆文档频率及特征选择函数分别作为衡量特征的文档代表性、文档区分性及类别区分性的因子，通过实验测试了它们对分类性能的影响，得到文档代表性因子能使分类效果峰值最高但抵抗噪音特征能力差、文档区分性因子具有抗噪能力但性能不稳定、而类别区分性因子抗噪能力最强且性能最稳定的结论。最后给出权重表示的四点构造原则，并通过实验验证了其对分类性能的优化效果。

英文摘要：

In traditional vector space based text categorization models, term weighting and feature selection are absolutely isolated. Although feature selection functions give a score to each term, the score is Seldom taken into account while weighting terms. This paper adopts term frequency, inverse document frequency and feature selection functions as the indication of the features＂ ability in representing a document, distinguishing different documents and distinguishing different categories respectively. The experimental results show that TF can raise the peak of the performance but it is sensitive to noisy features; IDF is tough to noise and but unstable; the feature selection function has strong moise-tolarent ability with stability. Finally, four criteria are proposed to combine the above factors to establish optimal weighting schemes and are further verified by experiments.

同期刊论文项目