首次把条件随机域(CRF)模型应用到了中文名实体识别中,且根据中文的特点,定义了多种特征模板。同时,为了解决长距离约束问题,将词语触发对融合到了CRF模型中。提出了基于词语方差(word variance)的选词方法,在词语相关性计算上,采用了平均互信息(AMI)方法和χ^2统计量方法。通过在半年人民日报上的测试,结果表明在采用相同特征集合的条件下,条件随机域模型较其他概率模型有更好的性能表现;融合长距离触发对的条件随机域模型可以使系统的F量度提高约1.38%。
In this paper, a new probabilistic model, conditional random fields (CRF), which is very fit for labeling sequence data, is firstly introduced to the task of Chinese named entity recognition (CNER). Unlike the generative model, CRF does not make effort on the observation modeling and can utilize rich overlapped features; moreover it can avoid the label bias problem of discriminative model. In order to perform CNER, special features are selected to capture more informative traits of Chinese language. In addition, word triggers are integrated to CRF to solve the long distance constraint problem, which has advantages of small parameters space and memory size compared with mining parallel information in a large sized window or whole sentence. Word triggers are selected by two steps: preparing candidate words and estimating the correlation degree of two words. The two methods of AMI (Average Mutual Information) and χ^2 statistic are used to estimate the correlation degree. Experimental results on haft-year People' s Daily show that the CRF together with word triggers extracted by the method of χ^2 Can achieve the state-of-the-art performance.