为使文本向量能准确表达文本信息、提升文本分类效果,提出了一种强化类别贡献的文本特征权重方案.利用后验概率定义了特征词的类别贡献度函数,结合相关频率权重因子,得到兼顾类别贡献度与类问分布差异的文本特征权重量化方案.在4个标准语料集上的测试结果表明,该方案实现简单,能更准确地刻画不同特征对分类的贡献差异,优化文本表示,并显著地提高文本分类效果.
To accurately express text information by vector and improve the performance of text categorization, a term weighting scheme with enhanced category contribution for text categorization was proposed. Combining the term weighting factor of relevance frequency with the defined category contribution function based on posterior probability, the scheme gave consideration to the description of both category contribution and distributional differences among categories for terms. Experimental results on the four standard corpora show that the proposed scheme do accurately describe the contributions of different features on the classification, optimize the works of text representation and outperform the state- of-the-art methods.