大规模语料库中分词结果不一致现象影响着语料库分词质量.在对150万汉字熟语料进行了统计分析的基础上,我们定义了语料库中分词结果不一致的主要结构类型;采用规则的方法检验校对字串的分词不一致,在对150万汉字语料库的封闭测试中,正确率为86.94%.
The inconsistency of segment for Chinese statistic and analysis of the Chinese corpus for 1.5 for the segment inconsistencies was defined, and corpus impacts the quality of the corpus. Based on the million Chinese characters ,the main types of structure the inconsistencies were checked by using a regular method. The corpus were close tested ,and the correct rate was 86.94 %.