文本分类是信息检索和数据挖掘的基础,被广泛应用于网络数据挖掘及搜索引擎等方面。首先对文本进行分词,对分词的结果分别使用x^2。统计量(CHI)方法与相关系数法(CC法)进行降维,并使用维数调节的思想进行特征提取。在得到特征集后,使用覆盖算法作为文本分类器进行学习。实验结果表明,通过结合相关系数法、覆盖算法以及维数调节方法,可实现一个效果较好的文本分类器。
Text classification is the base of information retrieval and data mining and it is widely used in web data mining and search engine. Divides texts into words firstly and uses two methods named x^2 statistic and correlation coefficient to reduce dimensions, and then uses dimension regulation to obtain the feature. After getting the feature set, uses cover algorithm as a text classifier to study. The result of experiment indicates that it is an effective way to realize a text classifier with the combination of correlation coefficient, cover algorithm and dimension regulation.