蒙古语形态分析中,我们之前的有向图模型取得了较高的性能。这种建模方式以图状结构刻画句中词干和词缀之间的概率关系,从而借助上下文信息为每个词确定最佳的切分标注候选。为每个词尽可能地枚举出所有合法的切分标注候选,是有向图模型有效工作的前提。该文提出了一种基于判别式分类的词干词缀切分策略,与之前基于词干表和词缀表的枚举方案相比,该方法对于词中含有未登录词干的情形具有更好的泛化能力。以20万词规模的三级标注人工语料库为训练数据,采用判别式词干词缀切分的有向图形态分析器,对于含有未登录词干的情形,词级切分标注正确率提高了7个百分点。
In Mongolian lexical analysis,the directed-graph-based model achieves high performance.This model uses a directed-graph architecture to describe the probabilistic relationship of stems and affixes,thus to determine the best segmented and tagged candidate for each word according to the context.Therefore,it is essential for a directed-graph-based analyzer to enumerate all legal segmented and tagged candidates for each word.This paper proposes a novel stem-affix segmentation model based on discriminative classification method for Mongolian lexical analysis.Compared with the enumeration strategy based on the stem-and affix sets,this method shows better generalization ability for the words with unknown stems.Using the 3rd-level annotated corpus with about 200000 words as the training data,the directed-graph-based lexical analyzer with discriminative stem-affix segmentation module achieves further 7% improvement on F1 measure(with unknown stems considered).