提出了一种优化互信息文本特征选择方法。针对互信息模型的不足之处主要从三方面进行改进:用权重因子对正、负相关特征加以区分;以修正因子的方式在MI中引入词频信息对低频词进行抑制;针对特征项在文本里的位置差异进行基于位置的特征加权。该方法改善了MI模型的特征选择效率。文本分类实验结果验证了提出的优化互信息特征选择方法的合理性与有效性。
This paper puts forward a kind of optimizing Mutual Information(M1) text characteristic selection method. Aiming at the MI' s deficiencies, it puts forward three approaches to improvement. The positive and negative fea- tures with the weight factors are distinguished. Through the introduction of the correct factors way, the low-frequency word is realized to restrain. According to the features position in the text, a further weighted method is put forward. In this way, the paper has improved the efficiency of MI model. Subsequent text classification experimental results show the proposed optimization MI and rationality of the method is effective.