Multimodal transformer augmented fusion for speech emotion recognition

Yuanyuan Wang; Yu Gu; Yifei Yin; Yingping Han; He Zhang; Shuang Wang; Chenyu Li; Dou Quan

doi:10.3389/fnbot.2023.1181598

Frontiers in Neurorobotics (May 2023)

Multimodal transformer augmented fusion for speech emotion recognition

Yuanyuan Wang,
Yu Gu,
Yifei Yin,
Yingping Han,
He Zhang,
Shuang Wang,
Chenyu Li,
Dou Quan

Affiliations

Yuanyuan Wang: School of Artificial Intelligence, Xidian University, Xi'an, China
Yu Gu: School of Artificial Intelligence, Xidian University, Xi'an, China
Yifei Yin: Guangzhou Huya Technology Co., Ltd., Guangzhou, China
Yingping Han: School of Artificial Intelligence, Xidian University, Xi'an, China
He Zhang: School of Journalism and Communication, Northwest University, Xi'an, China
Shuang Wang: School of Artificial Intelligence, Xidian University, Xi'an, China
Chenyu Li: School of Artificial Intelligence, Xidian University, Xi'an, China
Dou Quan: School of Artificial Intelligence, Xidian University, Xi'an, China

DOI: https://doi.org/10.3389/fnbot.2023.1181598
Journal volume & issue: Vol. 17

Abstract

Read online

Speech emotion recognition is challenging due to the subjectivity and ambiguity of emotion. In recent years, multimodal methods for speech emotion recognition have achieved promising results. However, due to the heterogeneity of data from different modalities, effectively integrating different modal information remains a difficulty and breakthrough point of the research. Moreover, in view of the limitations of feature-level fusion and decision-level fusion methods, capturing fine-grained modal interactions has often been neglected in previous studies. We propose a method named multimodal transformer augmented fusion that uses a hybrid fusion strategy, combing feature-level fusion and model-level fusion methods, to perform fine-grained information interaction within and between modalities. A Model-fusion module composed of three Cross-Transformer Encoders is proposed to generate multimodal emotional representation for modal guidance and information fusion. Specifically, the multimodal features obtained by feature-level fusion and text features are used to enhance speech features. Our proposed method outperforms existing state-of-the-art approaches on the IEMOCAP and MELD dataset.

Published in Frontiers in Neurorobotics

ISSN: 1662-5218 (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Medicine: Internal medicine: Neurosciences. Biological psychiatry. Neuropsychiatry
Website: https://www.frontiersin.org/journals/neurorobotics/

About the journal

Abstract

Keywords