针对传统TF-IDF算法关键词提取效率低下及准确率欠佳的问题,提出一种基于词频统计的文本关键词提取方法。首先,通过齐普夫定律推导出文本中同频词数的计算公式;其次,根据同频词数计算公式确定文本中各频次词语所占比重,发现文本中绝大多数是低频词;最后,将词频统计规律应用于关键词提取,提出基于词频统计的TFIDF算法。采用中、英文文本实验数据集进行仿真实验,其中推导出的同频词数计算公式平均相对误差未超过0.05;确立的各频次词语所占比重的最大误差绝对值为0.04;提出的基于词频统计的TF-IDF算法与传统TF-IDF算法相比,平均查准率、平均查全率和平均F1度量均有提高,而平均运行时间则均有降低。实验结果表明,在文本关键词提取中,基于词频统计的TF-IDF算法在查准率、查全率及F1指标上均优于传统TF-IDF算法,并能够有效减少关键词提取运行时间。
Focused on low efficiency and poor accuracy of the traditional TF-IDF( Term Frequency-Inverse Document Frequency) algorithm in keyword extraction,a text keyword extraction method based on word frequency statistics was proposed. Firstly,the formula of the same frequency words in text was deduced according to Zipf 's law; secondly,the proportion of each frequency word in text was determined in accordance with the formula of the same frequency words,most of which were low-frequency words; finally,the TF-IDF algorithm based on word frequency statistics was proposed by applying the word frequency statistics law to keyword extraction. Simulation experiments were conducted on Chinese and English text experiment data sets. The average relative error of the formula of the same frequency words was not more than 0. 05; the maximum absolute error of the proportion of each frequency word in text was 0. 04. Compared with the traditional TF-IDF algorithm,the average precision,the average recall and the average F1-measure of the TF-IDF algorithm based on word frequency statistics were increased respectively,while the average runtime was decreased. The simulation results show that in text keyword extraction,the TF-IDF algorithm based on word frequency statistics is superior to the traditional TF-IDF algorithm in precision,recall and F1-measure,and it can effectively reduce the runtime in keyword extraction.