现有的未知实体词识别方法主要针对人名、地名、机构名等具有特定结构的实体词进行识别,而随着电子商务和社交网络的快速发展,出现了大量结构不确定的专有领域未知实体词。针对该问题,提出两种基于上下文相关的未知词识别算法,通过计算词(字)和词(字)之间的上下文相关性,得到其潜在组合的支持度,并通过过滤模块过滤掉错误的组合,实现具有非确定型结构的未知实体词识别。实验表明,该算法具有较高的准确率,并且可以通过调整参数适应不同的应用场景。
Existing unknown words recognition methods mainly focus on unknown words with some specific structure, such as names, places and organizations. However, with the booming of e-commerce and social networking, more and more unknown entity words with uncertain structures appear in specific areas. In order to handle this problem, this paper presents two algorithms of unknown words recognition based on context-sensitive method. We first calculate correlations between any two words in sequence to get support of any potential combination, then filter out wrong combinations by filtering module, and achieve the recognition aiming at the non-deterministic structure of unknown words. Experiment results indicate that two algorithms can achieve a high accuracy. Besides, they can adapt to different application scenarios by adjusting the parameters.