专利关键词是对专利文献的高度概括,正确提取专利文献中的关键词对于专利文献的分类、标引、聚类等具有重要意义。结合专利文献的特点,在目前已有方法的基础上,提出了专利文献中领域公共词提取方法、词素加权方法以及并列结构惩罚的方法,将其应用到专利文献关键词抽取中。在过滤公共词的基础上,综合运用词在文献中出现的位置、词频、词素和并列结构计算词对文献主题的影响度,抽取专利文献中的关键词。实验结果表明,在抽取关键词个数为5-9个时,所提方法优于局部加权TF-IDF方法,验证了所提方法的有效性。
Patent keywords are the high-level summary of the patent document, correctly extract the key words in the patent document has important implications for patent document classification, indexing, clustering, etc. In this paper, a common word extraction algorithm has been proposed. Based on removing common words, the special position of the comprehensive literature weighted, the lexical units weighted and the penalty function of parallel structure are used to extract the keywords in the patent literature. Experimental results show when the number of the keywords ranges from 5 to 9, the presented method is much better than the baseline method, which show the feasi-ble of the proposed method.