分词和词性标注是中文处理中的一项基本步骤,其性能的好坏很大程度上影响了中文处理的效果。传统上人们使用基于词典的机械分词法,但是,在文本校对处理中的文本错误会恶化这种方法的结果,使之后的查错和纠错就建立在一个不正确的基础上。文中试探着寻找一种适用于文本校对处理的分词和词性标注算法。提出了全切分和一体化标注的思想。试验证明,该算法除了具有较高的正确率和召回率之外,还能够很好地抑制文本错误给分词和词性标注带来的影响。
Segment and part-of - speech tagging is two important procedures in Chinese processing. Use machine segment based on dictionary traditionally, but during the process of proofreading the errors in the input texts would deteriorate the result of segment and tagging, and then the errors' detection and correction would be made on base of the inexact output. In the paper, tried to find a method suitable for proofreading, and a combined of automatic segment and tagging approach was proposed, which was proved effective to minimize the influence of the errors with a high precise and callback rate.