为了提高中文文本的分类效果,提出了一种基于演化超网络的中文文本分类方法.采用中国科学院计算技术研究所的汉语词法分析系统对中文文本进行分词,保留文本中的名词、动词和形容词作为特征;以χ2统计方法进行特征选择;利用布尔权重计算特征权值.经处理后的特征向量作为系统的训练集和测试集数据.运用超边替代策略训练超网络分类模型,并实现对测试集特征向量的分类.对不同阶数设定下的演化超网络模型进行了性能分析,并将其与传统的KNN和SVM算法进行了比较.结果表明,本方法对复旦大学语料和搜狐语料可获得87.2%和72.5%的宏识别率、86.9%和70.5%的宏召回率、87.0%和71.5%的宏F1,接近或优于KNN和SVM分类方法.所提出的方法是一种有效的中文文本分类手段.
In order to improve the performance of Chinese text categorization, a Chinese text categorization method was proposed based on evolutionary hypernetwork. A Chinese Lexical Analysis System ( ICT- CLAS) was employed to take the words with parts of verb, noun and adjective as candidate features. The χ2-test method was used to realize feature selection, and the feature weight was calculated by Boolean weighting. The preprocessed data sets were divided into training set and testing set. A hyperedge replacement strategy was used to train hypernetwork classification model for classifying testing sets. The classification performances of the hypernetwork models with different orders were analyzed and compared with traditional KNN and SVM. The experimental results show that the proposed scheme can achieve 87.2% and 72.5% of macro precision, 86.9% and 70.5% of macro recall, 87.0% and 71.5% of macro FI for Fudan University corpus and Sohu corpus, respectively. As an efficient tool for Chinese text classification, the proposed scheme is close to or better than KNN and SVM classification methods.