东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于词共现的文档表示模型

ISSN号：1003-0077
期刊名称：中文信息学报
时间：0
页码：51-57
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]天津大学管理学院,天津300072, [2]天津大学网络与信息中心,天津300072
相关基金：国家自然科学基金资助项目（70901054）
相关项目：NIS安全风险评估中不确定性推理建模与风险传播问题研究

作者：常鹏|冯楠|

关键词：文档建模, 词共现, 文档相似度, 文本挖掘, document model, co-occurrence, document similarity, text mining

中文摘要：

文档表示模型是文本自动处理的基础,是将非结构化的文本数据转化为结构化数据的有效手段。然而,目前通用的空间向量模型（Vector Space Model,VSM）是以单个的词汇为基础的文档表示模型,因其忽略了词间的关联关系,导致文本挖掘的准确率难以得到很大的提升。该文以词共现分析为基础,讨论了文档主题与词的二阶关系之间的潜在联系,进而定义了词共现度及与文档主题相关度的量化计算方法,利用关联规则算法抽取出文档集上的词共现组合,提出了基于词共现组合的文档向量主题表示模型（Co-occurrence Term based Vector SpaceModel,CTVSM）,定义了基于CTVSM的文档相似度。实验表明,CTVSM能够准确反映文档之间的相关关系,比经典的文档向量空间模型（Vector Space Model,VSM）具有更强的主题区分能力。

英文摘要：

This paper presents a novel co-occurrence terms based vector space model（CTVSM） for automatic document indexing which is inspired by the Vector Space Model（VSM）.In contrast to the traditional VSM which presents the document with a bag of words regardless the position of these words in the texts,the proposed technique uses the co-occurrence terms instead of the single term.Firstly the pairs of obvious co-occurrence terms are extracted from the document set by association rules,and then the similarity between documents is also defined in this paper.The experiments indicate substantial and consistent improvements of the CTVSM over standard VSM.

同期刊论文项目