传统的谱特征(诸如MFCC)来源于对语谱图特征的再加工提取,但存在着因分帧处理引起相邻帧谱特征之间相关性被忽略的问题和所提取的谱特征与目标标签不相关的问题。这导致了从语谱图中提取的特征丢失了很多有用信息。为此,提出了获取深度谱特征(Deep Spectral Feature,DSF)的算法。DSF的特征是把直接从语谱图中提取的谱特征用于深度置信网络(DBN)训练,进而从隐层节点数较少的瓶颈层提取到瓶颈特征。为了解决传统谱特征的第一种缺陷,采用相邻多帧语音信号中提取的特征参数构成DSF特征。而深度置信网络所具有的强大自学习能力以及与目标标签密切相关的性能,使得经过微调的DSF特征能够解决传统谱特征的第二个缺陷。大量的仿真实验结果表明,相对于传统MFCC特征,经过微调的DSF特征在语音情感识别领域的识别率比传统MFCC高3.97%。
Traditional spectral features ( such as MFCC) can be extracted from spectrogram features. However, the relation between spec- tral features of adjacent frames has been ignored owing to frames division. What' s worse,the extracted spectral features are uncorrelated with the labels of corresponding targets ,which lead to useful feature information lost. Therefore,a new Deep Spectral Feature (DSF) al- gorithm has been proposed,in which DSF features have been gained by applying spectral feature directly extracted from spectrogram for Deep Belief Network (DBN) and a kind of bottleneck (BN) feature from the bottleneck layer has been obtained with least hidden layer nodes number. To deal with the first drawback, a method is proposed to extract characteristic parameters from adjacent frames that consist of DSF features. What is more, owing to strong self-learning ability and substantial relationship with target labels in deep belief network, the proposed DSF feature can supply a better solution to the second drawback of conventional spectral features. Experimental results show that the accuracy of DSF feature with proper fine-tuning outperforms traditional MFCC about 3.97% in speech emotion recognition.