目前的研究大多把向量空间模型中特征项的选取与权重的计算分开,掩盖中文分词时产生的语义缺失,导致特征项区分度下降。为此,提出一种基于统计与规则的关键词抽取方法。利用句法规则提取出基本短语,以取代词袋模型中的词,考虑特征项位置、分布及语法角色等信息,综合加权计算特征项权重。实验结果表明,与现有方法相比,该方法能够更有效地进行文本信息过滤。
Currently,the items selection and calculation of weight are divided by most studies in Vector Space Model(VSM).Defects,such as the semantic vacancy of words after segmentation and low degree of differentiation based on the methods of frequency-based weight calculation,are caused.To overcome this shortcoming,a method of key words extraction based on statistics and rules is proposed.The basic phrases are extracted by the rules of phrase syntax and instead of the words as terms in this method.Full account of feature frequency,position,distribution and grammatical role or other information,a joint feature weight function is constructed,to improve the differentiation of terms and weaken the semantic vacancy of words.Experimental results show that the keywords based on statistics and rules are more effective than others in the text information filtering.