现阶段基于链接时序分类技术的端到端的大规模连续语音识别成为研究热点,文中将其应用于藏语识别中,取得优于主流的双向长短时记忆网络性能.在基于端到端的语音识别中,不需要发音字典等语言学知识,识别性能无法得到保证.文中提出将已有的语言学知识结合至端到端的声学建模中,采用绑定的三音子作为建模单元,解决建模单元的稀疏性问题,大幅提高声学建模的区分度和鲁棒性.在藏语测试集上,通过实验证明文中方法提高基于链接时序分类技术的声学模型的识别率,并验证语言学知识和基于端到端声学建模技术结合的有效性.
End to end speech recognition based on connectionist temporal classification (CTC) is applied to the Tibetan automatic speech recognition (ASR), and the performance is better than that of the state-of-the- art bidirectional long short-term memory approach. In end to end speech recognition, the linguistic knowledge such as pronunciation lexicon is not essential, and therefore the performance of the ASR systems based on CTC is weaker than that of the baseline, Aiming at this problem, a strategy combining the existing linguistic knowledge and the acoustic modeling based on CTC is proposed, and the tri-phone is taken as the basic units in acoustic modeling. Thus, the sparse problem of the modeling unit is effectively solved, and the discrimination and robustness of the CTC model are improved substantially. Results on the test set of Tibetan corpus show that the word accuracy of the model based on CTC is improved substantially and the effectiveness of the combination of the linguistic information and the CTC modeling is verified.