主题词抽取是文本自动处理的基础性工作.在对现有主题词抽取方法深入研究的基础上,提出了一种基于词共现图的文档主题词自动抽取方法;该方法以基于词频统计方法为基础,利用在词共现图形成的主题信息以及不同主题问的连接特征信息自动地提取文档中的主题词,旨在找出一些非高频词且又对主题贡献大的词.实验表明了该抽取方法抽取出的主题词更能准确地符合了作者的主题.
Advances in high-volume storage media have led to an explosion in the amount of machine readable text. Keyphrase extraction is one of the fundamental works of natural language processing. In this paper, a novel automatic text keyphrase extraction method based on word co-occurrence is put forward on the basis of the research of existing keyphrase extraction method. The method, based on word frequency statistics utilizes text subject information based on word co-occurrence graph and linkage information of different text subjects. Our goal is to extract keyphrases with content most accurately matching specific and unique interest of the user. This algorithm for extracting keyphrases represents the asserted main point in a document, without relying on external devices such as natural language processing tools or a document corpus. Our algorithm is based on the segmentation of a graph, representing the co occurrence between terms in a document, into clusters. Each cluster corresponds to a concept on which author' s idea is based, and the top ranked terms on statistical basis. The relationship between each term to these clusters is selected as keyphrases. The experimental results show that thus extracted terms match author's point quite accurately, even though this method does not use the average frequency of each term in a corpus, i.e., this method is a content sensitive, domain independent device of indexing. Its purpose finds the words of nonfrequeney but great contribution to text subject. The concepts or ideas. greatest benefit is the extraction of nonfrequency words which carry the effect of the document, i. e. , preseuted by the author. This merit can lead to the satisfaction of search engine users with unique interests.