针对k-均值算法对初始点敏感、易陷入局部最优的问题,提出一种基于词性和中心点改进的文本聚类方法(STICS).通过改进文本的语义型表示,优化中心点的选取,并消除孤立点的负面影响,从而获得较好的聚类效果.STICS考虑不同词性特征对文本的贡献,采用加权的向量空间模型来表示文本.对于中心点的选取,首先度量每个样本的样本平均相似度,其次选取样本平均相似度最大的样本作为第一个聚类中心.此外,STICS消除孤立点的负面影响,以此提高聚类效果.实验结果表明文中方法确实具有更好的聚类效果.
The traditional k-means algorithm is sensitive to the initial point and easy to fall into local optimum. An improved speech to text and improved center selection (STICS) based text clustering method is proposed. Taking into account the speech to text, the optimal selection of centers and treatment of outliers concurrently, STICS has three aspects of improvement. The weighted vector space model (VSM) is used to represent text according to the speech to text. For the selection of the center, the sample average similarity is measured for each sample, and the sample with the largest sample average similarity is selected as the first center. In addition, STICS method eliminates the negative influences of isolated points, or outliers. Both theoretical analysis and experimental results prove that the proposed algorithm has better clustering results.