近年来微博的快速发展为命名体识别提供了新的载体,同时微博的特点也为命名体识别研究带来了挑战.针对微博特点,本文提出了基于拼音相似距离以及文本相似距离聚类算法对微博文本进行规范化,消除了微博的语言表达不规范造成的干扰.同时,本文还提出了篇章级、句子级以及词汇级三级粒度的特征提取,使用条件随机场模型进行训练数据,并识别命名体,采用由微博文本相似聚类获得的实体关系类对命名体类型进行修正.由于缺少大量的微博训练数据,本文采用半监督学习框架训练模型.通过对新浪微博数据的实验结果表明,本方法能够有效地提高微博中命名体识别的效果.
In recent years, the rapid development of micro-blog provides named entity recognition(NER) with a new carrier. While the characteristics of the micro-blogs also brings challenges for NER research. Considering the characteristics of micro-blogs, this paper proposed a mehtod, which was based on an pi- nyin similar distance and text similar distance, to normalize the micro-blogging text, eliminating the in- terference caused by non-standardized expression. Meanwhile, the paper also proposed three-level-fea- tures extraction and applied the conditional random field model to train and identify the named entities. Besides, a simple method was employed to fix the named entity recognition results, which was obtained from clustering the similar micro-blogs text. Lacking of training data, this paper built a semi-supervised learning framework to train the model. The results of experiment on Sina micro-blogs data showed that this approach could improve the named entity recognition effectively.