随着网络的发展,情感分类任务受到广大研究人员的密切关注。针对情感分类中的不平衡数据分布和高维特征问题,该文比较研究了四种经典的特征选择方法在不平衡情感分类中的应用。同时,该文提出了三种不同的特征选择模式并实验比较了这三种模式在分类和降维性能方面的表现。实验结果表明在不平衡数据的情感分类任务中,特征选择方法能够在不损失分类效果的前提下显著降低特征向量的维度。此外,特征选择方法中信息增益(IG)结合"先随机欠采样后特征选择"模式能够取得最佳的分类效果。
With the rapid development of Internet, the task of sentiment classification has attracted a great attention by many researchers in the area of natural language processing. In this paper, we focus on the sentiment classifica- tion tasks where the data distribution is imbalanced (named imbalanced sentiment classification). To reduce the high-dimensional feature space in imbalanced sentiment classification, we investigate four classic feature selection (FS) methods that are popularly studied in traditional text categorization. Furthermore, three different feature se- lection modes are proposed and compared in the specific task. The experimental results demonstrate that using the feature selection methods is capable of significantly reducing the dimension of the feature vector without any loss in the classification performance. Besides, the results show that the FS method of information gain (IG) combined with the mode "Feature selction after random under-sampling" performs best.