藏语自动分词是藏语信息处理的基础性关键问题,而紧缩词识别是藏语分词中的重点和难点。目前公开的紧缩词识别方法都是基于规则的方法,需要词库支持。该文提出了一种基于条件随机场的紧缩词识别方法,并在此基础上实现了基于条件随机场的藏语自动分词系统。实验结果表明,基于条件随机场的紧缩词识别方法快速、有效,而且可以方便地与分词模块相结合,显著提高了藏语分词的效果。
Tibetan automatic word segmentation (TAWS) is an important problem in Tibetan information process- ing, while abbreviated word recognition is one of the key and most difficult problems in TAWS. All the existing methods of Tibetan abbreviated word recognition are rule-based approaches, which need vocabulary support. In this paper, we propose a method based on conditional random field (CRF) for abbreviated word recognition, and then implement a TAWS system with CRF. The experimental results show that our abbreviated word recognition method is fast and effective and can be combined easily with the segmentation model based on conditional random fields. This significantly increases the effect of the Tibetan word segmentation.