为处理高维稀疏的大规模文档数据,提出一种基于强类别特征近邻传播(SCFAP)的半监督文本聚类算法.聚类过程中,利用少量带类别标签的监督数据,提取具有强类别区分能力的特征项以构建更有效的样本间相似性测度.并在每轮迭代完成后将类别确定性程度最高的未标记样本转移到已标注集,使算法执行效率提高.实验结果表明,这种改进对于近邻传播算法的性能和准确度的提升有较大帮助,在Reuter-21578和20Newsgroups两个相异数据集上,SCFAP算法表现较好的适用性.综合考察聚类微平均P指标和类簇纯度R指标,该算法在少量监督信息辅助下能快速获得较好的聚类结果.
A semi-supervised text clustering based on strong classification features affinity propagation (SCFAP) is proposed to handle spare document data with large scale and high dimensions. In the clustering process, strong classification features are extracted to construct a reasonable similarity measure by using a small amount of labeled samples. Moreover, in order to improve the execution efficiency of the algorithm, the unlabeled documents with maximum category certainty are transferred from unlabeled collection to labeled collection in each round of iteration. The experimental results show that the irr, provement is greatly helpful to upgrade the performance and accuracy of the classical affinity propagation algorithm. The SCFAP algorithm shows better applicability on Reuter-~21578 and 20 Newsgroups. The micro average F" index and the clustering purity index are synthetically observed, the semi-supervised text clustering algorithm based on SCFAP can get better clustering results rapidly.