Entropy (Dec 2022)

Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation

  • Huawei Tao,
  • Shuai Shan,
  • Ziyi Hu,
  • Chunhua Zhu,
  • Hongyi Ge

DOI
https://doi.org/10.3390/e25010068
Journal volume & issue
Vol. 25, no. 1
p. 68

Abstract

Read online

The absence of labeled samples limits the development of speech emotion recognition (SER). Data augmentation is an effective way to address sample sparsity. However, there is a lack of research on data augmentation algorithms in the field of SER. In this paper, the effectiveness of classical acoustic data augmentation methods in SER is analyzed, based on which a strong generalized speech emotion recognition model based on effective data augmentation is proposed. The model uses a multi-channel feature extractor consisting of multiple sub-networks to extract emotional representations. Different kinds of augmented data that can effectively improve SER performance are fed into the sub-networks, and the emotional representations are obtained by the weighted fusion of the output feature maps of each sub-network. And in order to make the model robust to unseen speakers, we employ adversarial training to generalize emotion representations. A discriminator is used to estimate the Wasserstein distance between the feature distributions of different speakers and to force the feature extractor to learn the speaker-invariant emotional representations by adversarial training. The simulation experimental results on the IEMOCAP corpus show that the performance of the proposed method is 2–9% ahead of the related SER algorithm, which proves the effectiveness of the proposed method.

Keywords