东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于分离模型的中文关键词提取算法研究

ISSN号：1003-0077
期刊名称：《中文信息学报》
时间：0
分类：TP391.1[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]国防科学技术大学计算机学院,湖南长沙410073
相关基金：国家自然科学基金资助项目（60403050）;新世纪优秀人才支持计划资助项目（NCET-06-0926）

关键词：计算机应用, 中文信息处理, 关键词提取, 关键词串, 分离模型, 互信息, 词串边界参数表, computer application, Chinese information processing, keyword extraction, keyphrases, separate model, mutual information, word-sequence boundary

中文摘要：

关键词提取在自动文摘、信息检索、文本分类、文本聚类等方面具有十分重要的作用。通常所说的关键词实际上有相当一部分是关键的短语和未登录词，而这部分关键词的抽取是十分困难的问题。该文提出将关键词提取分为两个问题进行处理：关键单词提取和关键词串提取，设计了一种基于分离模型的中文关键词提取算法。该算法并针对关键单词提取和关键词串提取这两个问题设计了不同的特征以提高抽取的准确性。实验表明，相对于传统的关键词提取算法，基于分离模型的中文关键词提取算法效果更好。

英文摘要：

Keyword extraction plays an important role in information retrieval, automatic summarizing, text clustering, and text classification, etc. A significant portion of keywords usually extracted are actually key phrases or the words not recorded yet, which makes the keyword extraction more difficult. This paper argues that the keyword extraction can be treated as two problems： extracting key words and extracting key phrases. A keyword extraction algorithm based on separate models was proposed, with different features developed for the two mentioned problems so as to improve the accuracy of keywords extracted from the Chinese documents. The experiment results show that the proposed algorithm has a better performance compared with the traditional keyword extraction algorithms.

同期刊论文项目