文本特征选择是在文本自动分类中最重要的一个环节。为了更好地解决维吾尔文文本分类中特征空间的高维性和文档表示向量的稀疏性问题,提出一种基于特征的类别分布差异和信息熵的维吾尔文文本特征选择方法。该方法不仅要考虑特征在类别间的分布情况,而且也要考虑特征在类别内的分布情况。采用本方法对维吾尔文文本语料进行了分类实验,并与一些传统的特征选择方法进行了比较。从结果来看,本方法在所选特征数更少的情况下,达到了比其他方法更高的分类MacroF,值85.3%,比传统的IG和CHI等方法在MacroF,值上分别高出了4.3%和6.1%。
Text feature selection is the most important phase in automatic text categorization. In order to solve the high dimen- sionality and sparsness of text vector in Uyghur text categorization, this paper proposed the new Uyghur text feature selection method based on class distribution difference and term entropy. The propesed method not only considered the inter-class distri- bution of the term, but also considered the inner-class distribution of the term. It conducted the categorization experiments on the Uyghur text corpus using proposed method and compared with the traditional feature selection methods. The experimental results show that the categorization MacroF, value is reached 85.3% and achieves the improvement of 4.3% and 6. 1% re- spectivly comparing to IG and CHI.