文章提出一种基于N-gram复合分词的领域概念自动获取方法,在中文分词的基础上进行N元复合切分,通过建立一系列过滤规则,提取候选领域概念;然后以改进的TF—IDF作为衡量领域相关性的统计特征值,计算候选概念的领域相关性;最后进行人工辅助判断与筛选。以航空发动机领域语料为样本进行了实践探索,实验结果表明该方法能有效抽取专业领域概念,具有较强的实用性。
This paper presents a method to acquire domain concepts automatically based on N-gram composite word segmenta- tion, which implements N-element composite word segmentation on the basis of Chinese word segmentation, and extracts the candi- date domain concepts by establishing a series of filtering rules. Then, an improved TF-IDF is used as a statistical feature value for the measurement of domain relevance to calculate the domain relevance of candidate concepts. Finally, the artificial auxiliary judg- ment and filtering will be taken. The paper carries out a practical exploration with the corpus in the aero engine field as the sample, and the experimental results show that this method can effectively extract the concepts in the professional domain and has stronger practicability.