近些年来,情感分类在自然语言处理研究领域获得了显著的发展。然而,大部分已有的研究都假设参与分类的正类样本和负类样本一样多,而实际情况中正负类数据的分布往往是不平衡的。该文收集四个产品领域的中文评论文本,发现正类样本的数目远远多于负类样本。针对不平衡数据的中文情感分类,提出了一种基于欠采样和多分类算法的集成学习框架。在四个不同领域的实验结果表明,我们的方法能够显著提高分类性能,并明显优于目前主流的多种不平衡分类方法。
Sentiment classification has undergone significant development in recent years.However,most existing studies assume the balance between the numbers of negative and positive samples,which may not be true in reality.In this paper,we collect product reviews from four domains and find that the positive samples are much more than negative ones.To handle the imbalanced classification in Chinese sentiment classification,we propose a novel approach to combine both sampling and classification algorithms under an ensemble learning framework.Evaluation across different domains shows the proposed approach performs better than several existing imbalanced classification methods.