提出一种基于词间关联度度量的维吾尔文本自动切分方法。该方法从大规模生语料库中自动获取维吾尔文单词Bi-gram及上下文语境信息,在充分考虑维吾尔文单词间结合规则的前提下,将相邻单词间的互信息、t-测试差及双词邻接对熵的线性融合作为组合统计量(dmd),度量文本中相邻单词之间的关联程度。以dmd度量的弱关联的词间位置作为切分点进行自动切分,得到语义及结构完整的词串,而不仅仅是以空格隔开的单词。在大规模文本语料上进行的测试表明,该方法的切分准确率达到88.21%。
This paper puts forward a new idea and related algorithms for Uyghur segmentation. The word based Bi-gram and contextual information are derived from large scale raw corpus automatically, and according to the Uyghur word association rules, the liner combinations of mutual information, difference of t-test and dual adjacent entropy are taken as a new measurement to estimate the association strength between two adjacent Uyghur words. The weakly associated inter-word position is taken as a segmentation point and the perfect word strings both on its semantics and structural integrity, not just the words separated by spaces, is obtained. The experimental result on large-scale corpus shows that the proposed algorithm achieves 88.21% segmentation accuracy.