位置:成果数据库 > 期刊 > 期刊详情页
基于词频统计的文本关键词提取方法
  • ISSN号:1001-9081
  • 期刊名称:《计算机应用》
  • 时间:0
  • 分类:TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
  • 作者机构:[1]河北师范大学数学与信息科学学院,石家庄050024, [2]河北师范大学河北省计算数学与应用重点实验室,石家庄050024, [3]河北师范大学移动物联网研究院,石家庄050024
  • 相关基金:国家自然科学基金资助项目(71271067); 国家社会科学基金资助项目(13BTY011);国家社会科学基金重大项目(13&ZD091); 河北省高等学校科学技术研究项目(QN2014196); 河北师范大学硕士基金资助项目(201402002)
中文摘要:

针对传统TF-IDF算法关键词提取效率低下及准确率欠佳的问题,提出一种基于词频统计的文本关键词提取方法。首先,通过齐普夫定律推导出文本中同频词数的计算公式;其次,根据同频词数计算公式确定文本中各频次词语所占比重,发现文本中绝大多数是低频词;最后,将词频统计规律应用于关键词提取,提出基于词频统计的TFIDF算法。采用中、英文文本实验数据集进行仿真实验,其中推导出的同频词数计算公式平均相对误差未超过0.05;确立的各频次词语所占比重的最大误差绝对值为0.04;提出的基于词频统计的TF-IDF算法与传统TF-IDF算法相比,平均查准率、平均查全率和平均F1度量均有提高,而平均运行时间则均有降低。实验结果表明,在文本关键词提取中,基于词频统计的TF-IDF算法在查准率、查全率及F1指标上均优于传统TF-IDF算法,并能够有效减少关键词提取运行时间。

英文摘要:

Focused on low efficiency and poor accuracy of the traditional TF-IDF( Term Frequency-Inverse Document Frequency) algorithm in keyword extraction,a text keyword extraction method based on word frequency statistics was proposed. Firstly,the formula of the same frequency words in text was deduced according to Zipf 's law; secondly,the proportion of each frequency word in text was determined in accordance with the formula of the same frequency words,most of which were low-frequency words; finally,the TF-IDF algorithm based on word frequency statistics was proposed by applying the word frequency statistics law to keyword extraction. Simulation experiments were conducted on Chinese and English text experiment data sets. The average relative error of the formula of the same frequency words was not more than 0. 05; the maximum absolute error of the proportion of each frequency word in text was 0. 04. Compared with the traditional TF-IDF algorithm,the average precision,the average recall and the average F1-measure of the TF-IDF algorithm based on word frequency statistics were increased respectively,while the average runtime was decreased. The simulation results show that in text keyword extraction,the TF-IDF algorithm based on word frequency statistics is superior to the traditional TF-IDF algorithm in precision,recall and F1-measure,and it can effectively reduce the runtime in keyword extraction.

同期刊论文项目
同项目期刊论文
期刊信息
  • 《计算机应用》
  • 北大核心期刊(2011版)
  • 主管单位:四川省科学技术协会
  • 主办单位:四川省计算机学会中国科学院成都分院
  • 主编:张景中
  • 地址:成都市人民南路四段九号科分院计算所
  • 邮编:610041
  • 邮箱:xzh@joca.cn
  • 电话:028-85224283
  • 国际标准刊号:ISSN:1001-9081
  • 国内统一刊号:ISSN:51-1307/TP
  • 邮发代号:62-110
  • 获奖情况:
  • 全国优秀科技期刊一等奖,国家期刊奖提名奖,中国期刊方阵双奖期刊,中文核心期刊,中国科技核心期刊
  • 国内外数据库收录:
  • 俄罗斯文摘杂志,波兰哥白尼索引,美国剑桥科学文摘,英国科学文摘数据库,日本日本科学技术振兴机构数据库,中国中国科技核心期刊,中国北大核心期刊(2004版),中国北大核心期刊(2008版),中国北大核心期刊(2011版),中国北大核心期刊(2014版),中国北大核心期刊(2000版)
  • 被引量:53679