文本分类是信息检索和数据挖掘中的重要主题之一。文中提出了一种基于贪婪覆盖算法的文本分类方法,首先对文本进行分词,分词的结果用CHI统计量的方法提取特征,使用TF—IDF-ICSD进行特征权重计算。对贪婪覆盖算法采用另一种选取初始点的方法来构建分类器,用复旦大学语料库作为测试数据集,并与BP算法相比较。实验结果表明文本提出的方法是有效的。
Text classification is one of the key topics in information retrieval and data mining. A new text categorization technique based on greedy cover algorithm (GCA) was presented in this paper. The method can be conducted as following, text segmentation, feature extraction using CHI statistic, calculating feature weighting with TF- IDF- ICSD, constructing classifier for GCA by employing another initial point. The proposed method was experimented on some test dataset taken from the Corpus of Fudan University. The test results show that the proposed method is feasible and effective compared to BP neural network algorithm.