东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

一种基于词共现图的文档主题词自动抽取方法

ISSN号：0469-5097
期刊名称：《南京大学学报：自然科学版》
时间：0
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]中国科学技术大学计算机科学技术系,合肥230027, [2]安徽师范大学计算机系,芜湖241000
相关基金：国家自然科学基金（70171052,90104030）,安徽省教育厅自然科学基金（2005kj009zd）

关键词：自然语言处理, 词共现图, 主题词, TFIDF, natural language processing, word co-occurrence graph, keyphrase, term-frequency in verse-document-frenquency （TFIDF）

中文摘要：

主题词抽取是文本自动处理的基础性工作．在对现有主题词抽取方法深入研究的基础上，提出了一种基于词共现图的文档主题词自动抽取方法；该方法以基于词频统计方法为基础，利用在词共现图形成的主题信息以及不同主题问的连接特征信息自动地提取文档中的主题词，旨在找出一些非高频词且又对主题贡献大的词．实验表明了该抽取方法抽取出的主题词更能准确地符合了作者的主题．

英文摘要：

Advances in high-volume storage media have led to an explosion in the amount of machine readable text. Keyphrase extraction is one of the fundamental works of natural language processing. In this paper, a novel automatic text keyphrase extraction method based on word co-occurrence is put forward on the basis of the research of existing keyphrase extraction method. The method, based on word frequency statistics utilizes text subject information based on word co-occurrence graph and linkage information of different text subjects. Our goal is to extract keyphrases with content most accurately matching specific and unique interest of the user. This algorithm for extracting keyphrases represents the asserted main point in a document, without relying on external devices such as natural language processing tools or a document corpus. Our algorithm is based on the segmentation of a graph, representing the co occurrence between terms in a document, into clusters. Each cluster corresponds to a concept on which author＇ s idea is based, and the top ranked terms on statistical basis. The relationship between each term to these clusters is selected as keyphrases. The experimental results show that thus extracted terms match author＇s point quite accurately, even though this method does not use the average frequency of each term in a corpus, i.e., this method is a content sensitive, domain independent device of indexing. Its purpose finds the words of nonfrequeney but great contribution to text subject. The concepts or ideas. greatest benefit is the extraction of nonfrequency words which carry the effect of the document, i. e. , preseuted by the author. This merit can lead to the satisfaction of search engine users with unique interests.

同期刊论文项目