形式背景需要从实际的数据源中提取。当数据源为无结构的中文文本时,必须选择如何对其进行表示。目前主流的中文文本表示方法主要采用以词语为特征项的向量空间模型(VSM),其主要缺陷是忽略了自然语言中词语之间的语义联系,无法表达文本的语义信息。讨论了一种改进方法,其特征是:选择知网(Hownet)作为知识库,采用相似词集集合代替单一特征词,建立中文文本的概念向量空间。对于用概念向量空间表示的中文文本,可以方便地根据用户的具体要求提取所需的形式背景。以214篇交通类中文文本为实例阐释了该改进方法的实际应用。
A formal context must be extracted from data sources.But to extract a formal context from unstructured Chinese document needs to decide how to represent it first.The dominant model of document representation,which is called the Vector Space Model(VSM),uses a single word as the characteristic item.It is obvious that VSM neglects the lexical semantic relation between words,thereby it can not express the semantic information of documents.Discusses an improved method which is to take Hownet as knowledge base,to establish the concept vector space of Chinese document by using the set of similar word set to replace the single characteristic item in VSM.On the base of Chinese document with concept vector space,it is convenient to extract the formal context to meet user demand.Illustrate the application of this improved method with 214 Chinese texts about transportation as examples.