东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

用于形式背景提取的中文文本表示

ISSN号：1005-3751
期刊名称：计算机技术与发展
时间：0
页码：36-39
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]大连海事大学信息科学技术学院,辽宁大连116026
相关基金：国家自然科学基金资助项目（60972090）
相关项目：语义网本体不精确性模型研究

关键词：形式背景, 文本表示, 相似词集集合, 向量空间模型, formal context, document representation, set of similar word set, vector space model

中文摘要：

形式背景需要从实际的数据源中提取。当数据源为无结构的中文文本时,必须选择如何对其进行表示。目前主流的中文文本表示方法主要采用以词语为特征项的向量空间模型（VSM）,其主要缺陷是忽略了自然语言中词语之间的语义联系,无法表达文本的语义信息。讨论了一种改进方法,其特征是：选择知网（Hownet）作为知识库,采用相似词集集合代替单一特征词,建立中文文本的概念向量空间。对于用概念向量空间表示的中文文本,可以方便地根据用户的具体要求提取所需的形式背景。以214篇交通类中文文本为实例阐释了该改进方法的实际应用。

英文摘要：

A formal context must be extracted from data sources.But to extract a formal context from unstructured Chinese document needs to decide how to represent it first.The dominant model of document representation,which is called the Vector Space Model（VSM）,uses a single word as the characteristic item.It is obvious that VSM neglects the lexical semantic relation between words,thereby it can not express the semantic information of documents.Discusses an improved method which is to take Hownet as knowledge base,to establish the concept vector space of Chinese document by using the set of similar word set to replace the single characteristic item in VSM.On the base of Chinese document with concept vector space,it is convenient to extract the formal context to meet user demand.Illustrate the application of this improved method with 214 Chinese texts about transportation as examples.