针对传统维度语音情感识别系统采用全局统计特征造成韵律学细节信息丢失以及特征演化规律缺失的问题,本文提出了一种基于不同时间单元的多粒度特征提取方法,提取了短时帧粒度、中时段粒度以及长时窗粒度特征,并提出了一种可以融合多粒度特征的基于认知机理的回馈神经网络(Cognition-Inspired Recurrent Neural Network,CIRNN)。该网络模拟了人脑处理语音信号时"循序渐进"的过程,通过融合多粒度特征,使得不同时间单元的特征均参与网络训练,既突出了情感的时序性,也保留了全局特性对情感识别的作用,实现多层级信息融合。该网络同时模拟大脑运用以往经验模式进行对比的过程,在网络中引入记忆层,用于记忆上文情感特征,强化了上下文信息对识别的影响作用。本文将该方法用于VAM维度语料库的维度情感识别,分别从Activation、Dominance、Valence三个维度进行测试,平均相关系数为0.66,识别结果明显优于传统ANN和SVR的识别结果。
In order to reduce the prosodic information lacking induced by utterance-term global statistic features which were widely used by traditional speech emotion recognition,a novel multi-granularity feature extraction method is proposed in this paper. This method is based on different time units which include short-term frame features,mid-term fragments features and long-term windowing features. To fuse these multi-granularity features,we propose a cognitive-inspired recurrent neural network( Cognition-Inspired Recurrent Neural Network,CIRNN). CIRNN assembles different time-level features to simulate the human being’s step by step process on audio signals and it realizes the multi-level information fusion by highlighting both the time-sequence of emotion and the role of content information. The proposed methods are further examined on the VAM database to estimate continuous emotion primitives in a three-dimensional feature space spanned by activation,valence,and dominance and the average correlation coefficient is 0. 66. The experimental results show that,the proposed system has a significant improvement for speech emotion estimation compared with the commonly used ANN and SVR approaches.