随着微博等社交平台的兴起,如何针对微博数据进行产品命名实体识别成为了自然语言处理领域研究的热点之一,也是实现舆情监督和商业智能的基础.传统的命名实体识别技术没有考虑中文微博口语化、不规范等特点,且忽略了深层语义对命名实体识别的重要作用.因此,考虑中文微博的特殊性,提出一种融合全局上下文信息的词向量特征选择方法,分别采用主题模型和神经网络词向量聚类两种方法获取深层语义信息,并结合层叠条件随机场进行中文微博的命名实体识别.实验结果表明,基于词向量聚类的中文微博产品命名实体识别方法取得了较好的效果.
With the rise of microblog and other social intercourse platforms, how to identify the named entity of their product has become one of a hot spots of investigation in the domain of natural language processing, becoming also the basis of public opinion supervision and business intelligence. However, the traditional named entity recognition technique does not take the unstandardized characteristics of spoken language of Chinese microblog and its nonstandard feature into account and ignores the importance of deep semantic information for named entity recognition. Therefore, considering the particularity of Chinese mi- croblog sufficiently in this paper, a selection method of word-vector feature with the fusion of global contextual information is proposed, the deep semantic information is obtained with both the topic model and neural network, and combined with the cascaded conditional random field, the named entity recognition of Chinese microblog product is conducted. Experimental result shows that the method of named entity recognition of Chinese microblog product based on word vector clustering will get a better result.