藏文分词是藏文信息处理的基础性关键问题,目前基于序列标注的藏文分词方法大都采用音节位置特征和类别特征等。该文从无标注语料中抽取边界熵特征、邻接变化数特征、无监督间隔标注等无监督特征,并将之融合到基于序列标注的分词系统中。从实验结果可以看出,与基线藏文分词系统相比,分词F值提高了0.97%,并且未登录词识别结果也有较大的提高。说明,该文从无标注数据中提取出的无监督特征较为有效,和有监督的分词模型融合到一起显著提高了基线分词系统的效果。
Tibetan word segmentation (TWS) is an important problem in Tibetan information processing, while the current TWS features are mostly adopt the syllable position and syllable categories. The paper extracted unsupervised features, for example, boundary entropy, accessorvariety and unsupervised gap tagging, from unlabeled corpus,and studied the TWS merged with unsupervised features. The experimental results show that, F score increase of 0.97% compare to the baselinesystem, the method get a good performance on out of vocabulary words. From the above, we can conclude that this method can effectively distracted from unlabeled corpus, which can be combined easily with the supervised segmentation model. The method can significantly increases the effect of the baseline TWS.