针对微博数据特点,采用降噪算法和条件随机场模型对微博数据进行词性标注,并对其中比重较大的谐音词使用贝叶斯方法进行词性二次纠正.首先利用新浪平台API和爬虫获取原始微博数据,再根据噪音特点人工制定规则进行降噪.由于条件随机场在中文词性标注中特征提取的优势,使用条件随机场模型对降噪后的微博语料词性标注.在此基础上,利用微博语料中谐音词比重较大的特点,将微博词语转化为拼音,根据贝叶斯方法计算得到谐音词的原生词候选,再根据词语的上下文建立谐音词和原生词映射,并利用原生词的词性已知的性质,对谐音词进行词性纠错.实验结果表明,该方法可以较好地标注微博未登录词,词性标注准确率达到95.23%.
The purpose of this work is to solve the problem of microblog part-of-speech(POS)tagging.POS tagging of Chinese new word is a difficult,important and widely-studied sequence modeling problem.This paper describes a hybrid model that combines a rule-based model with linear-chain conditional random fields(CRFs)and Bayes algorithm for the task of POS tagging of Microblog unknown words.Firstly,microblog data are obtained by using Sina API and web spider.According to the features of microblog,a rule-based method is presented to reduce the impact of noisy data in POS tagging.Then,since CRFs has an advantage in feature extraction of POS tagging,it is used to obtain initial POS tags of microblog new words.We also present a probabilistic POS tagging method,which further improves performances.Homophonic words account for a large proportion of microblog new words.Because the pronunciation between homophonic words and its original words are similar or identical,Chinese Phonetic Alphabet is used to buildthe bridge between them.And the bridge is built by using Naive Bayes algorithm.Evaluation on microblog test set shows that this system could tag the new words of microblog in a better way,the best precision it achieves is95.23%.