东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于聚类和索引技术的语言模型压缩方法

ISSN号：1673-629X
期刊名称：《计算机技术与发展》
时间：0
分类：TP319.14[自动化与计算机技术—计算机软件与理论;自动化与计算机技术—计算机科学与技术]
作者机构：[1]中国科学院上海应用物理研究所束测控制部门,上海201800, [2]盛大创新研究院语音主题部门,上海201203
相关基金：国家“973”重点基础研究发展计划项目（2011CB808300）

关键词：语言模型, 压缩方法, 聚类算法, 多级索引, language model, compression method, K-means clustering algorithm, multilevel index technology

中文摘要：

由于训练语料的庞大，SRILM训练生成的ARPA统计语言模型数据文件体积过大，导致查找效率低下以及消耗大量的存储空间。针对该问题，借鉴聚类和索引查找的思想，提出了一种基于K均值（K—means）聚类算法的对语言模型中的转移概率和回退概率压缩，并通过多级索引技术提高查找速度的压缩方法。理论分析和实验表明，该方法可以在减少压缩造成的数据失真对选词影响的同时，取得非常好的压缩效果，同时提高了对语言模型文件查找效率，并且输入法的反应速度得到了明显的提升。

英文摘要：

Because of the large-scale training corpus,the language model data file of the ARPA format produced by SRILM toolkit usual- ly takes too much space and reduces the search rate. For the problem, learning from the idea of unsupervised clustering analysis and multi level index ,proposed a compression method of N-Gram Chinese language model file based on K-means clustering algorithm and multi level index technology to increase search speed. Theoretical analysis and experiments show that the method can promptly obtain an out standing compression ratio and effectively reduce the redundant search times, showing a good performance.

同期刊论文项目