分析了基本STC算法存在的三个缺点,即不能有效处理包含文本数目差距较大但具有包含关系的节点,不能有效处理包含文本相似但主题不同的节点,缺乏有效的类别标识提取算法.针对以上问题,在综合考虑主题相似性以及文本包含相似性的基础上,给出了改进的用于基类合并的相似度公式,并提出了基于信息增益的类别标识提取算法.为了进一步提高聚类效率,给出了一种简单有效的用于基类选择的测度,用来排除一些无意义的广义后缀树节点.实验结果表明,所提算法不仅可以有效提高STC算法的聚类准确度,而且可以对聚类结果进行有效的类别标识.
The original suffix tree clustering (STC) algorithm can not effectively process the nodes with text documents that differ greatly in quantity but hold a relation of inclusion, neither the nodes that are similar in text but different in topic, and it lacks an effective algorithm for class label extraction. To solve these problems, an improved similarity formula is presented for base cluster merging based on both the similarity of topic and the included texts, and a class label extraction algorithm based on information gain is proposed. To improve the clustering efficiency, a simple but reasonable measure for base cluster selection is presented to exclude some generalized suffix tree nodes which contribute less tO the clustering. Experiment is made and the results prove that the presented clustering algorithm can efficiently increase the precision of text clustering and perform effective labeling for the clustering result.