提出一种基于密度峰值发现的文本聚类算法,将文本的距离与密度计算转化为文本向量的相似度计算,实现基于密度峰值发现的文本聚类算法。该算法采用空间向量模型表示文本,用余弦公式进行相似度计算,然后求得每个文本的密度和距离。剔除噪音点后,选取聚类中心,将剩下的非中心点划分到距离其最近的聚类中心所在的类簇中去。通过多组对比试验,验证了本方法的可靠性和鲁棒性。
A text clustering algorithm based on find of density peak was proposedin this paper. The algorithm was implemented by the calculation of text distance and density,which was in accordance with calculation of the text vector similarity. VSM( Vector Space Model) was used to express ducument to obtain the similarity calculation with cosine formula. The cucument work was to find the local density and the distance from points of higher density of each ducument,remove the noise points and select the cluster center. The remainednon-centralpoints were assigned into the cluster which was the nearest to the cluster center. According to several sets of contrast experiments,the density-based text clustering was improved to have an advantage of reliability and robustness.