该文提出了一种新的基于小世界网络特性的关键词提取算法。首先,利用K最邻近耦合图构成方式,将文档表示成为词语网络。引入词语聚类系数变化量和平均最短路径变化量来度量词语的重要性,选择重要性大的词语组成候选关键词集。利用侯选关键词集词语位置关系和汉语词性搭配关系,提取出复合关键词。实验结果表明该方法是可行和有效的,获取复合关键词比一般关键词所表达的含义更便于人们对文本的理解。
In this paper, a new algorithm is proposed for extracting compound keywords from the Chinese document by the small world network. Using k-nearest-neighbor coupled graph, a Chinese document is first represented as a network: the node represent the term, and the edge represent the co-occurrence of terms. Then, two variables, clustering coefficient increment and average path length increment, are introduced to measure term's importance and to generate the candidate keyword set. With factors such as co-operation between two any terms of part of speech in a sentence and the neighborhood between any two terms of the candidate set, some related words in the candidate set are combined as the compound keywords. The experimental results show that the algorithm is effective and accurate in comparision with the manual keywords extraction from the same document. The semantic representation by the compound keywords of a document is far more clearer than that of single keywords set, facilitating a better compre hension of the document.