文本分类的特点是高维的特征空间和高度的特征冗余.针对这两个特点,采用X。统计量处理高维的特征空间,利用信息新颖度的思想处理高度的特征冗余,根据最大边缘相关的定义,将二者有机结合,提出一种基于最大边缘相关的特征选择方法.该方法可以在特征选择过程中减少大量的冗余特征.最后,在Reuters-21578 Top10和OHSCAL两个文本数据集上进行实验.实验结果表明,基于最大边缘相关的特征选择方法比X2统计量和信息增益两种特征选择方法更高效,并且能够提高naive Bayes,Rocchio和kNN3种不同分类器的性能.
With the rapid growth of textual information on the Internet, text categorization has already been one of the key research directions in data mining. Text categorization is a supervised learning process, defined as automatically distributing free text into one or more predefined categories. At the present, text categorization is necessary for managing textual information and has been applied into many fields. However, text categorization has two characteristics: high dimensionality of feature space and high level of feature redundancy. For the two characteristics, Z2 is used to deal with high dimensionality of feature space, and information novelty is used to deal with high level of feature redundancy. According to the definition of maximal marginal relevance, a feature selection method based on maximal marginal relevance is proposed, which can reduce redundancy between features in the process of feature selection. Furthermore, the experiments are carried out on two text data sets, namely, Reuters-21578 Topl0 and OHSCAL. The results indicate that the feature 2 selection method based on maximal marginal relevance is more efficient than X2 and information gain. Moveover it can improve the performance of three different categorizers, namely, naive Bayes, Rocchio and k NN.