东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

挖掘专利知识实现关键词自动抽取

ISSN号：1000-1239
期刊名称：计算机研究与发展
时间：2016
页码：1740-1752
期号：08
便笺：11-1777/TP
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者地址：中山大学计算机科学系;广东第二师范学院计算机科学系;暨南大学信息科学技术学院;珠海魅族科技有限公司;
作者机构：[1]中山大学计算机科学系,广州510275, [2]广东第二师范学院计算机科学系,广州510303, [3]暨南大学信息科学技术学院,广州510632, [4]珠海魅族科技有限公司,广东珠海519085
相关基金：国家自然科学基金项目（61472453,U1401256,U1501252）; 广东省科技计划基金项目（2012A010701013）

关键词：背景知识, 关键词抽取, 专利数据, 支持向量机, 信息检索, background knowledge, keyword extraction, patent data, support vector machine（SVM）, information retrieval

中文摘要：

关键词是人们快速判断是否要详细阅读文件内容的重要线索,关键词自动抽取在信息检索、自然语言处理等研究领域均有重要应用.设计了一种新的关键词自动抽取方法,使计算机能够像人类专家一样,利用知识库对目标文本进行学习和理解,最终自动抽取出关键词.专利数据因其数据量庞大、内容丰富、表达准确、专业权威而被选中作为知识库来源.详细讨论了专利数据的特性,挖掘不同专利间的知识关联,针对某一知识领域构造背景知识库,在此基础上进行目标文本的关键词自动抽取.与目标文本相关的专利文集中每个专利的专利发明人、权利人、专利引用和分类信息都被用于在不同的专利文档之间发现关联性,利用关联信息扩充背景知识库,获得目标文档在各个相关知识领域的背景知识库.基于背景知识库设计了词知识特征值,以反映词在目标文本背景知识中的重要程度.最后,把关键词抽取问题转化为分类问题,利用支持向量机（support vector machine,SVM）抽取出目标文本的关键词.在专利数据集和开放数据集的实验结果证明明显优于现有算法.

英文摘要：

Keywords are important clues that can help a user quickly decide whether to skip,to scan,or to read the article.Keyword extraction plays an increasingly crucial role in information retrieval,natural language processing and other several text related researches.This paper addresses the problem of automatic keyword extraction and designs a novel automatic keyword extraction approach making use of patent knowledge.This approach can help computer to learn and understand the document as human being according to its background knowledge,finally pick out keywords automatically.The patent data set is chosen as external knowledge repository because of its huge amount of data,rich content,accurate expression and professional authority.This paper uses patent data set as the external knowledge repository serves for keyword extraction.An algorithm is designed to construct the background knowledge repository based on patent data set,also a method for automatic keyword extraction with novel word features is provided.This paper discusses the characters of patent data,mines the relation between different patent files to construct background knowledge repository for target document,and finally achieves keyword extraction.The related patent files of target document are used to construct background knowledge repository. The information of patent inventors,assignees,citations and classification are used to mining the hidden knowledge and relationship between different patent files.And the related knowledge is imported to extend the background knowledge repository.Novel word features are derived according to the different background knowledge supplied by patent data.The word features reflecting the document＇s background knowledge offer valuable indications on individual words＇importance in the target document.The keyword extraction problem can then be regarded as a classification problem and thesupport vector machine（SVM）is used to extract the keywords.Experiments have been done using patent data set and open data set.Experime

同期刊论文项目