由于训练语料的庞大,SRILM训练生成的ARPA统计语言模型数据文件体积过大,导致查找效率低下以及消耗大量的存储空间。针对该问题,借鉴聚类和索引查找的思想,提出了一种基于K均值(K—means)聚类算法的对语言模型中的转移概率和回退概率压缩,并通过多级索引技术提高查找速度的压缩方法。理论分析和实验表明,该方法可以在减少压缩造成的数据失真对选词影响的同时,取得非常好的压缩效果,同时提高了对语言模型文件查找效率,并且输入法的反应速度得到了明显的提升。
Because of the large-scale training corpus,the language model data file of the ARPA format produced by SRILM toolkit usual- ly takes too much space and reduces the search rate. For the problem, learning from the idea of unsupervised clustering analysis and multi level index ,proposed a compression method of N-Gram Chinese language model file based on K-means clustering algorithm and multi level index technology to increase search speed. Theoretical analysis and experiments show that the method can promptly obtain an out standing compression ratio and effectively reduce the redundant search times, showing a good performance.