对中文下位词自动抽取方法进行研究,提出一种基于词典信息和网络百科的下位词获取方法,旨在构建一个较为完善的上下位词语知识库.基于词典信息的抽取方法利用《中文概念词典》和《中国分类主题词表》中蕴含的格式化信息获取上下位关系.基于网络百科的抽取方法利用维基百科、百度百科和互动百科,分析百科网页地址和内容格式,利用正则式抽取下位词语.对获取到的下位词进行自动过滤和人工校对,实验表明,与NLP&CC 2012上下位关系评测结果相比,本文方法取得较好效果.
Hyponymy, a kind of basic semantic relation between words, is widly used in areas, including text classification and information retrieval. Automatic extraction of such relation is an important issue in natural language processing. Two kinds of hyponymy extraction strategy, i. e. , dictionary based strategy and encyclopedia based strategy are proposed to build a sophisticated hyponymy knowledge base. Chinese Concept Dictionary and Chinese Classied Subject Thesaurus are used as dictionary resources. Manual regex is introduced to extract hyponym from wikipedia,baidubaike and hudongbaike based on addresses of web pages. Extensive experimental evaluation demonstrates that the proposed strategies outperform the NLP&CC 2012 evaluation results.