新词发现在自然语言处理领域具有重要意义,在微博内容上的新词发现比在一般语料上更难.文中提出引入词关联性信息的迭代上下文熵算法,并通过上下文关系获取新词候选列表进行过滤.为进一步提高精度,引入自然语言处理中的词法特征,提出与统计特征相结合的过滤方法.与现有方法相比,准确率和召回率均有大幅提高,F-值提高到89.6%.
New words discovery is of great significance in the field of natural language processing. It is more difficult to find new words in microblog than in other corpus. In this paper, an algorithm based on context entropy is proposed, and the new word candidates are filtered based on the context. To improve the precision, lexical features are introduced and an algorithm combining them with term frequency is put forward. Thus, the precision rate and the recall rate are greatly improved, and the F-measure value is up to 89 . 6%.