东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于语义列表的中文文本聚类算法

ISSN号：1001-3695
期刊名称：《计算机应用研究》
时间：0
分类：TP311[自动化与计算机技术—计算机软件与理论;自动化与计算机技术—计算机科学与技术]
作者机构：[1]江苏大学计算机科学与通信工程学院,江苏镇江212013
相关基金：国家自然科学基金资助项目（60841003）; 国家火炬计划资助项目（2004EB33006）

关键词：文本聚类, 文本表示, 语义列表, 相似度计算, 聚簇表示, text clustering, text representation, semantic list, similarity calculation, cluster representation

中文摘要：

针对大多数基于向量空间模型的中文文本聚类算法存在高维稀疏、忽略词语之间的语义联系、缺少聚簇描述等问题,提出基于语义列表的中文文本聚类算法CTCAUSL（Chinese text clustering algorithm using semantic list）。该算法采用语义列表表示文本,一个文本的语义列表中的词是该文本中出现的词,从而降低了数据维数,且不存在稀疏问题;同时利用词语间的相似度计算解决了同义词近义词的问题;最后用语义列表对聚簇进行描述,增加了聚类结果的可读性。实验结果表明,CTCAUSL算法在处理大量文本数据方面具有较好的性能,并能明显提高中文文本聚类的准确性。

英文摘要：

Common Chinese document clustering algorithms rely on the so-called vector space models, to solve the problems in these methods,such as the text characteristic of high dimensions and sparse space,ignoring the semantic relations among words,and lack of the description of cluster,this paper proposed a Chinese text clustering algorithm using semantic list（CTCAUSL）.The algorithm used documents as semantic lists. Words in a document semantic list were those existing in this document,so reduced dimensions and there was no sparse space.In the meantime, the method used the similarity calculation to solve the synonym or near-synonym problem.Then,in order to improve the readability of cluster results,described clusters by semantic lists. The experimental results indicate that CTCAUSL performs well in dealing with a large number of document data, and has significantly improved the accuracy of Chinese text clustering.

同期刊论文项目