离散文本已经成为一种占据重要地位的舆情信息表现形式,根据离散文本的特点,提出基于特征概念网的离散文本舆情信息的分聚类框架,在此基础上给出分聚类方案。在聚类算法中,运用了遗传算法的全局并行搜索能力、k—means的高效局部聚类能力和小生境的保持种群多样性抑制漂移能力;在分类算法中,先将训练文本库进行类内聚类成子类,对子类构建特征概念网以生成替代该子类的文本,再用KNN算法进行分类。最后结合舆情分析进一步提出了可用的改进方案。
Discrete text has occupied an important position in public-opinion information. In order to analyze public opinion efficiently, this paper proposes a kind of high-performance classification and clustering algorithm according to characteristics of network discrete text based on characteristic concept network. This clustering algorithm integrates the efficiency of k-means, the parallel global search ability of genetic algorithms and the capability to maintain population diversity of Niche method to cluster texts. And in the classification algorithm, the sub-category is clustered into the training library first and then the text is classified by using KNN algorithm. Finally, some improvements are given.