为了完成特定领域的语音识别任务,利用有限的语料建立高性能的语言模型成为提高系统性能的关键.针对此问题,对特定领域的语言模型进行了研究.提出了利用高频新词来加强模型的领域特征的方法,采取了两种方案:一种是将高频新词直接加入原有字典,并在训练过程中增加这些新词的权重,使模型更能表达与领域相关的特征;一种是基于高频新词统计出一个和领域相关的小词表,并对这两种方案进行了比较研究.通过实验研究了适合汉语语言的平滑策略.最后,实验结果表明,对于特定领域问题,语言模型平滑算法对模型性能影响较大;采用适合汉语的Witten-Bell插值平滑,可以使识别率达到88.4%,比通用模型性能相对提高了18.18%.
It is important to build a powerful language model by using limited corpora in the field of speech recognition for a specific domain.To deal with this problem,two methods concerning how to process new words with high frequencies in a specific domain are presented.One way is to add the new words to the dictionary directly and then give them a high weight in the procedure of training.The other is to work out a new dictionary according to the new words. And based on some comparative experiments,these two methods and various smoothing algorithms are studied in detail. At last,it can be concluded that the performance of language model is affected by the smoothing algorithm greatly,and the Witten-Bell interpolation method could improve the recognition rate to 88.4%,which is 18.18% higher than the general language model.