东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

语义分析与词频统计相结合的中文文本相似度量方法研究

ISSN号：1001-3695
期刊名称：计算机应用研究
时间：2012
页码：833-836
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]苏州大学计算机科学与技术学院,江苏苏州215006, [2]江苏省计算机信息处理技术重点实验室,江苏苏州215006
相关基金：国家自然科学基金资助项目（60970056,61070123,61003155）; 模式识别国家重点实验室开发课题基金资助项目; 江苏省自然科学基金资助项目（BK2008160）; 高等学校博士学科点专项科研基金资助项目（20093201110006）
相关项目：多文档事件信息融合方法的研究

关键词：向量空间模型, 语义分析, 词频, 概率分布, 文本相似度, vector space model, semantic analysis, term frequency, probability distribution, text similarity

中文摘要：

基于统计的文本相似度量方法大多先采用TF-IDF方法将文本表示为词频向量,然后利用余弦计算文本之间的相似度。此类方法由于忽略文本中词项的语义信息,不能很好地反映文本之间的相似度。基于语义的方法虽然能够较好地弥补这一缺陷,但需要知识库来构建词语之间的语义关系。研究了以上两类文本相似度计算方法的优缺点,提出了一种新颖的文本相似度量方法,该方法首先对文本进行预处理,然后挑选TF-IDF值较高的词项作为特征项,再借助HowNet语义词典和TF-IDF方法对特征项进行语义分析和词频统计相结合的文本相似度计算,最后利用文本相似度在基准文本数据集合上进行聚类实验。实验结果表明,采用提出的方法得到的F-度量值明显优于只采用TF-IDF方法或词语语义的方法,从而证明了提出的文本相似度计算方法的有效性。

英文摘要：

Based on the statistical text similarity measurements method used TF-IDF method to model text documents as term frequency vectors,and computed similarity between documents by using cosine similarity.This method ignored semantic information of text documents,the similarity value wasn＇t correct.Although based on semantics method made up for the drawback,but need of knowledge to construct the relationship between words.By studying the advantages and disadvantages of two kinds of methods,this paper presented a novel text similarity method,which firstly pre-processed text,then chose the terms with higher TF-IDF value as the feature items,next used semantic dictionary and TF-IDF method to compute the text similarity,finally used several K-means clustering methods for evaluating performance of the new text document similarity.Experimental results show that the method＇s F-measure is superior to the others＇ which proves that the proposed method is effective.

同期刊论文项目