Journal of King Saud University: Computer and Information Sciences (Jun 2024)

A high speed inference architecture for multimodal emotion recognition based on sparse cross modal encoder

  • Lin Cui,
  • Yuanbang Zhang,
  • Yingkai Cui,
  • Boyan Wang,
  • Xiaodong Sun

Journal volume & issue
Vol. 36, no. 5
p. 102092

Abstract

Read online

In recent years, multimodal emotion recognition models are using pre-trained networks and attention mechanisms to pursue higher accuracy, which increases the training burden and slows down the training and inference speed. In order to strike a balance between speed and accuracy, this paper proposes a speed-optimized multimodal emotion recognition architecture for speech and text emotion recognition. In the feature extraction part, a lightweight residual graph convolutional network (ResGCN) is selected as the speech feature extractor, and an efficient RoBERTa pre-trained network is used as the text feature extractor. Then, an algorithm complexity-optimized sparse cross-modal encoder (SCME) is proposed and used to fuse these two types of features. Finally, a new gated fusion module (GF) is used to weight multiple results and input them into a fully connected layer (FC) for classification. The proposed method is tested on the IEMOCAP dataset and the MELD dataset, achieving weighted accuracies (WA) of 82.4% and 65.0%, respectively. This method achieves higher accuracy than the listed methods while having an acceptable training and inference speed.

Keywords