短语比词信息量更加丰富,更能够体现原文的主题,通常所说的关键词实际上多数为短语形式.然而目前的问题是关键词短语的自动标引缺乏统一的规则指导.本文利用粗集理论在数据泛化和知识约简方面的优势,对人工标注的人民日报关键词短语语料进行了挖掘,从而得到了中文关键词短语的若干构成规则.规则可以用于自动关键词抽取,也可以对手工关键词标引进行指导.实验结果表明获取的规则使关键词自动抽取的性能有较大改善.
Phrase conveys more information than word, and can better represent main topic of one article. Most of keywords we referred to are actually in form of phrases. The problem is that extraction of keyphrase lacks guidance of some general rules. By taking advantage of the ability of rough set theory on data generalization and knowledge reduction,the manually labeled keyphrase corpus which come from People's Daily was mined and some construction nile, s of Chinese keyphrase has been generated. These rule, s can be used for automatic keyword extraction, and can also help people manually label keyword. The experimental results are promising: the performance of keyword extraction improved greatly after importing these rules.