Taiyuan Ligong Daxue xuebao (Sep 2021)

Feature Fusion Based on Main-Auxiliary Network for Speech Emotion Recognition

  • Desheng HU,
  • Xueying ZHANG,
  • Jing ZHANG,
  • Baoyun LI

DOI
https://doi.org/10.16355/j.cnki.issn1007-9432tyut.2021.05.011
Journal volume & issue
Vol. 52, no. 5
pp. 769 – 774

Abstract

Read online

Speech emotion recognition is an important research direction of human-computer interaction. Effective feature extraction and fusion are among the key factors to improve the rate of speech emotion recognition. In this paper, a speech emotion recognition algorithm using Main-auxiliary networks for deep feature fusion was proposed. First, segment features are input into BLSTM-attention network as the main network. The attention mechanism can pay attention to the emotion information in speech signals. Then, the Mel spectrum features are input into Convolutional Neural Networks-Global Average Pooling (GAP) as auxiliary network. GAP can reduce the overfitting brought by the fully connected layer. Finally, the two are combined in the form of Main-auxiliary networks to solve the problem of unsatisfactory recognition results caused by direct fusion of different types of features. The experimental results of comparing four models on IEMOCAP dataset show that WA and UA using the depth feature fusion of the Main-Auxiliary network are improved to different degrees.

Keywords