现有分词系统不能及时收录新词语,因而不能有效识别领域组合词。针对此问题,提出一种位置标签与词性相结合的组合词抽取方法。首先对语料进行文本预处理、添加位置标签、加权词频过滤等建立词条的位置标签集;然后依据位置标签集计算词条在句子中的相邻度判定组合词;最后制定反规则对抽取结果进行过滤,并对垃圾串进行两端逐步消减再判定进一步识别组合词。通过在不同语料库上进行实验,结果表明本方法具有更高的准确率。
Now existing segmentation systems cannot recruit new words timely,so they cannot identify compound words effectively. To solve that,this paper proposed a method of compound word extraction based on location tag and POS( part of speech). First,this method established location tag set for each item by processing corpus texts,adding location tag for each item and filtering items with weighted term frequency. Then it counted adjacent degree to judge compound words on the basis of location tag set. Finally,formulated reverse rules and filtered garbage strings with them,detected combined words further from garbage strings by removing item from the head and the tail. Experiments were carried out on different corpora,and the results show that this method has higher precision.