基于统计的中文分词方法由于训练语料领域的限制,导致其领域自适应性能力较差。相比分词训练语料,领域词典的获取要容易许多,而且能为分词提供丰富的领域信息。该文通过将词典信息以特征的方式融入到统计分词模型(该文使用CRF统计模型)中来实现领域自适应性。实验表明,这种方法显著提高了统计中文分词的领域自适应能力。当测试领域和训练领域相同时,分词的F-measure值提升了2%;当测试领域和训练领域不同时,分词的F-measure值提升了6%。
Generally,statistical methods for Chinese Word Segmentation don't have good domain adaptability owing to the specific training corpus.In practice,domain dictionaries are more easily achieved than humanly annotated segmentation corpus,and it contains plenty of domain information.We propose an approach which integrates dictionary information into statistical models(i.e.,CRF model in this paper) to realize domain adaption for Chinese Word Segmentation.Experimental results show that our approach have good domain adaption.When the test corpus is identical to the domain of training corpus,the F-measure value increases 2%;when test corpus is in a different domain of the training corpus,the F-measure value increases 6%.