关键短语的抽取在文本聚类、分类、检索等方面有着重要的作用。利用经典的TF-IDF算法来提高文本关键短语抽取的质量。通过对TF-IDF算法的研究,发现TF-IDF可以综合利用单个文本信息和文本集合信息抽取文本关键词。在此基础上,提出一种综合TF-IDF、TextRank、统计学知识抽取关键短语的方法和利用候选关键短语逆向文档频率排序的方法。该方法在TextRank基础上,通过TF-IDF引入词的文本集合信息计算词之间权重得到词的得分。然后利用统计学知识从上一步选出词组成的短语筛选出候选关键短语。最后利用逆向文档频率的思想对候选关键短语排序。实验证明,该模型相比于经典TextRank模型准确率提高了2%,召回率提高了4.5%,F-measure提高了3.4%。
Keyphrase extraction plays a significant role in text clustering, classification, retrieval and so on. This paper uses the classic TF-IDF algorithm to improve the quality of text keyphrase extraction. By studying the TF-IDF algorithm, it is found that the TF-IDF can extract the text keywords by using the single text information and the text collection information. On this basis, this paper proposes a keyphrase extraction method by combining TF-IDF, TextRank, statistical knowledge and inverse document frequency sorting by candidate keyphrase. Based on the TextRank, this method calculates the weight of the words by TF-IDF to get the word score. And then use the statistical knowledge from the previous step to select the phrases of the phrase selected candidate keyphrases. Finally, the candidate keyphrases are sorted by the idea of inverse document frequency. Experiments show that the accuracy of this model is 2% higher than that of classical TextRank model, and the recall rate increased by 4. 5% and F-measure increased by 3.4%.