针对现有的K—Means算法K值需要人工赋值、随机选取初始中心点、文本表示维度高且缺乏语义的缺陷,提出了一种基于概念格的K-Means算法——-K—MeansBCC(K-means algorithm based on concept lattice)。将文本集经预处理转化为形式背景,在此基础上生成概念格;利用概念格中的概念表示文本,根据文本中概念的权重确定K值、选取初始中心点。最后设计了文本间的概念相似度计算公式,并由K—Means算法产生聚类结果。实验结果表明,该算法提高了聚类的效率和准确性。
Aiming at the problems of the existing K-Means algorithm, such as artificial assignation of number of final clustering, random selection of initial centers, high dimension and lack of semantic information in text representation, a new K-Means algorithm called K- MeansBCC is proposed. Firstly, concept lattice is generated on the basis of formal context to which texts are converted by pre-process, then K-MeansBCC expresses texts using the concepts in concept lattice, and determines K values and initial centers according to the weight of concepts, finally the formula of concept similarity between texts is designed, and clustering result by K-Means algorithm is generated. The experimental result show that this algorithm improves the efficiency and accuracy of the clustering.