针对大多数基于向量空间模型的中文文本聚类算法存在高维稀疏、忽略词语之间的语义联系、缺少聚簇描述等问题,提出基于语义列表的中文文本聚类算法CTCAUSL(Chinese text clustering algorithm using semantic list)。该算法采用语义列表表示文本,一个文本的语义列表中的词是该文本中出现的词,从而降低了数据维数,且不存在稀疏问题;同时利用词语间的相似度计算解决了同义词近义词的问题;最后用语义列表对聚簇进行描述,增加了聚类结果的可读性。实验结果表明,CTCAUSL算法在处理大量文本数据方面具有较好的性能,并能明显提高中文文本聚类的准确性。
Common Chinese document clustering algorithms rely on the so-called vector space models, to solve the problems in these methods,such as the text characteristic of high dimensions and sparse space,ignoring the semantic relations among words,and lack of the description of cluster,this paper proposed a Chinese text clustering algorithm using semantic list(CTCAUSL).The algorithm used documents as semantic lists. Words in a document semantic list were those existing in this document,so reduced dimensions and there was no sparse space.In the meantime, the method used the similarity calculation to solve the synonym or near-synonym problem.Then,in order to improve the readability of cluster results,described clusters by semantic lists. The experimental results indicate that CTCAUSL performs well in dealing with a large number of document data, and has significantly improved the accuracy of Chinese text clustering.