东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

一种提高文本聚类算法质量的方法

ISSN号：0253-374X
期刊名称：《同济大学学报：自然科学版》
时间：0
分类：TP312[自动化与计算机技术—计算机软件与理论;自动化与计算机技术—计算机科学与技术]
作者机构：[1]厦门大学信息科学与技术学院,福建厦门361005
相关基金：国家自然科学基金资助项目（50474033）

作者：冯少荣[1]

关键词：文本聚类, 语义距离, 最近邻聚类, 相似度, 聚类算法, text clustering, semantic distance, nearest neighbor clustering, similarity, clustering algorithm

中文摘要：

针对基于VSM（vector space model）的文本聚类算法存在的主要问题，即忽略了词之间的语义信息、忽略了各维度之间的联系而导致文本的相似度计算不够精确，提出基于语义距离计算文档间相似度及两阶段聚类方案来提高文本聚类算法的质量．首先，从语义上分析文档，采用最近邻算法进行第一次聚类；其次，根据相似度权重，对类特征词进行优胜劣汰；然后进行类合并；最后，进行第二次聚类，解决最近邻算法对输入次序敏感的问题．实验结果表明，提出的方法在聚类精度和召回率上均有显著的提高，较好解决了基于VSM的文本聚类算法存在的问题．

英文摘要：

The main problem with the text clustering algorithm based on vector space model （VSM） is that semantic information between words and the link between the various dimensions are overlooked, resulting in inaccuracy in the text similarity calculation. A method based on computing the text similarity using semantic distance and two-phrase clustering is proposed to improve the text clustering algorithm. First, the text analyzed according to its semantic,with nearest neighbor algorithm used for the first cluster. Some feature words are chosen according to the similarity weight to represent the cluster with the remaining feature words similar to the main themes of the cluster, and then class combination is carried out. Finally, the second clustering is carried out to improve the nearest neighbor clustering which is sensitive to the input order of the document. Simulation experiments indicate that the proposed algorithm can solve these problems and performs better than the text clustering algorithm based on VSM in the clustering precision and recall rate.

同期刊论文项目

矿业数据仓库集成关键技术研究

期刊论文 8 会议论文 1 著作 2

同项目期刊论文

基于语义距离的高效文本聚类算法

煤炭企业决策分析系统的设计

矿业信息质量评估与应用研究

矿业信息异构数据的共享

数据仓库集成环境研究与实现

基于树形结构的Web信息抽取

一种新的基于隐Markov模型的分层时间序列聚类算法

期刊信息

《同济大学学报：自然科学版》
北大核心期刊（2011版）

主管单位:教育部
主办单位:同济大学
主编：李杰
地址：上海四平路1239号
邮编：200092
邮箱：zrxb@tongji.edu.cn
电话：021-65982344

国际标准刊号：ISSN：0253-374X
国内统一刊号：ISSN：31-1267/N
邮发代号:4-260

获奖情况:
国家双百期刊,第二届国家期刊奖重点科技期刊奖,1999年全国优秀高校自然科学学报一等奖

国内外数据库收录:
俄罗斯文摘杂志,美国化学文摘（网络版）,美国数学评论（网络版）,德国数学文摘,荷兰文摘与引文数据库,美国工程索引,美国剑桥科学文摘,中国中国科技核心期刊,中国北大核心期刊（2004版）,中国北大核心期刊（2008版）,中国北大核心期刊（2011版）,中国北大核心期刊（2014版）,中国北大核心期刊（2000版）

被引量:34557