通过对情感文本与n-gram特征的研究与分析,提出了一种基于卡方统计的特征词提取方法.方法中,n-gram特征作为文本特征,在传统卡方统计的基础上选取共现或单独出现的特征,因为共现与单独出现的特征在不同类别中可能存在区别性.然后,根据多元特征与类别的相关性判别去除n-gram中冗余的特征,从而选取高类别相关而低冗余的n-gram特征.对上述方法利用SVM算法在不同语料中进行测试,通过实验对比分析,验证了该方法的有效性.
Because of the short sentiment text length, the lack of information, and the sparseness of features. When use the n-gram approach, the redundancy and relevance between words are ignored. This paper proposes n-gram features selection method based on Chi-square statistics. Firstly, each feature is evaluated by taking into account the simultaneous or individual occurrence of features within the feature set. Based on the idea that the occurrence of one feature but not the other may also convey valuable information for discrimination. Then the redundancy between words is reduced by chi-square statistic algorithm calculate the relevance between features and categories. So that we can extract n-gram features of high categories relevance and low redundancy. Finally, using Support Vector Machine classifier to identify the text orientation in different corpus, the experimental results show that this method improves the accuracy of text classification.