在领域知识库的构建过程中,领域概念的识别是一项非常重要的步骤。当前基于统计方法仅按词频进行领域概念的识别,而一些较长的重要领域概念恰恰是低频词,因而对低频领域概念识别准确率不高。为了提高低频领域概念的识别准确率,本文提出了一种基于词向量的加权HITS算法。此方法首先将开放文本中的领域概念表示为词向量,然后使用加权HITS算法计算其领域相关度,最终筛选出领域相关度超过一定阈值的领域概念构建领域知识库。实验证明,本文提出的方法与现有方法相比,在领域概念识别的准确率和召回率方面有一定的改进。特别地,该方法将低频领域概念识别召回率提高了10%。
In the construction of the domain knowledge base,domain concepts recognition is a very important step.So far,statistics-based methods recognize the domain concepts only by word frequency,while word frequency of long important domain concepts is very low,so they can't recognize the low-frequency domain concepts well.In order to improve the recognition rate of low-frequency domain concepts,a word vector-based weighted-HITS method is proposed.The word is first expressed as a vector,and then the domain pertinence is decided using weighted-HITS,finally the domain concepts of which domain pertinence exceeds a threshold value are populated into the domain knowledge base.The experiments shows that compared with existing methods,it performs better in recognizing domain concepts.Especially,the recall of the low-frequency domain concepts is improved by 10%.