在传统的视听双模态语音识别系统的研究中,经图像处理后的视觉特征往往具有数据量大、重要特征丢失等问题。针对这些问题,拟采用图像可听化技术对视频图像进行特征提取。以遗传算法优化的BP神经网络为融合模型,对视频、音频特征进行特征级融合。实验结果表明,经过图像可听化处理后视觉特征包含了一定的语音信息,在噪声环境下的识别效果比较稳定,神经网络的融合模型提高了系统的鲁棒性。
While studying the traditional speech recognition system with audio-video dual mode, we found that the visual characteristics "after image processing have the problems of large amount of data and important characteristics lost. Aiming at these problems, we plan to apply image sonification technology to extracting the characteristics of video image. By using BP neural network in genetic algorithm optimisation as the fusion model, we fuse the characteristics of audio and video at feature level. Experimental results show that, after being processed by the image sonification, the visual characteristics contain certain speech information, its recognition effect is stable in noise environment as well. The fusion model of neural network improves the robustness of the system.