为增强向量空间模型(VSM)中项的语义描述性,克服VSM中各语义单元相互独立的缺陷,提出一种基于短语的特征粒度描述方法。该方法从文本的表示及特征项之间的组织方式入手,通过句法规则识别基本短语,构建特征与中心动词的关系树,利用基本短语代替BOW中的词。实验结果表明,采用基本短语的文本表示可提高分类的性能,增加项之间的联系,克服特征项相互独立的缺陷,在特征数量较少的情况下仍能保持良好的分类效果。
In order to improve the semantic description of items, and minify impact by mutual independence of terms in Vector Space Model (VSM), this paper proposes a phrase-based text representation. This model analyzes the relationship of the feature items, recognizes basic phrases by development of syntactic rules, and forms the related tree which contains feature items and head verb. It uses phrase-based to describe text instead of words in BOW, thereby the shortcoming of mutual independence is overcome. Experimental result indicates that the new approach improves the performance of the classifier, increases links between terms, and keeps classifying texts correctly, even if the number of feature items is small.