关键词提取在自动文摘、信息检索、文本分类、文本聚类等方面具有十分重要的作用。通常所说的关键词实际上有相当一部分是关键的短语和未登录词,而这部分关键词的抽取是十分困难的问题。该文提出将关键词提取分为两个问题进行处理:关键单词提取和关键词串提取,设计了一种基于分离模型的中文关键词提取算法。该算法并针对关键单词提取和关键词串提取这两个问题设计了不同的特征以提高抽取的准确性。实验表明,相对于传统的关键词提取算法,基于分离模型的中文关键词提取算法效果更好。
Keyword extraction plays an important role in information retrieval, automatic summarizing, text clustering, and text classification, etc. A significant portion of keywords usually extracted are actually key phrases or the words not recorded yet, which makes the keyword extraction more difficult. This paper argues that the keyword extraction can be treated as two problems: extracting key words and extracting key phrases. A keyword extraction algorithm based on separate models was proposed, with different features developed for the two mentioned problems so as to improve the accuracy of keywords extracted from the Chinese documents. The experiment results show that the proposed algorithm has a better performance compared with the traditional keyword extraction algorithms.