文档表示模型是文本自动处理的基础,是将非结构化的文本数据转化为结构化数据的有效手段。然而,目前通用的空间向量模型(Vector Space Model,VSM)是以单个的词汇为基础的文档表示模型,因其忽略了词间的关联关系,导致文本挖掘的准确率难以得到很大的提升。该文以词共现分析为基础,讨论了文档主题与词的二阶关系之间的潜在联系,进而定义了词共现度及与文档主题相关度的量化计算方法,利用关联规则算法抽取出文档集上的词共现组合,提出了基于词共现组合的文档向量主题表示模型(Co-occurrence Term based Vector SpaceModel,CTVSM),定义了基于CTVSM的文档相似度。实验表明,CTVSM能够准确反映文档之间的相关关系,比经典的文档向量空间模型(Vector Space Model,VSM)具有更强的主题区分能力。
This paper presents a novel co-occurrence terms based vector space model(CTVSM) for automatic document indexing which is inspired by the Vector Space Model(VSM).In contrast to the traditional VSM which presents the document with a bag of words regardless the position of these words in the texts,the proposed technique uses the co-occurrence terms instead of the single term.Firstly the pairs of obvious co-occurrence terms are extracted from the document set by association rules,and then the similarity between documents is also defined in this paper.The experiments indicate substantial and consistent improvements of the CTVSM over standard VSM.