Bimodal Fusion Network with Multi-Head Attention for Multimodal Sentiment Analysis

Rui Zhang; Chengrong Xue; Qingfu Qi; Liyuan Lin; Jing Zhang; Lun Zhang

doi:10.3390/app13031915

Applied Sciences (Feb 2023)

Bimodal Fusion Network with Multi-Head Attention for Multimodal Sentiment Analysis

Rui Zhang,
Chengrong Xue,
Qingfu Qi,
Liyuan Lin,
Jing Zhang,
Lun Zhang

Affiliations

Rui Zhang: School of Software and Communications, Tianjin Sino-German University of Applied Sciences, Tianjin 300222, China
Chengrong Xue: School of Software and Communications, Tianjin Sino-German University of Applied Sciences, Tianjin 300222, China
Qingfu Qi: Gaussian Robotics Pte. Ltd., Tianjin 200100, China
Liyuan Lin: College of Electronic Information and Automation, Tianjin University of Science & Technology, Tianjin 300222, China
Jing Zhang: School of Software and Communications, Tianjin Sino-German University of Applied Sciences, Tianjin 300222, China
Lun Zhang: School of Software and Communications, Tianjin Sino-German University of Applied Sciences, Tianjin 300222, China

DOI: https://doi.org/10.3390/app13031915
Journal volume & issue: Vol. 13, no. 3
p. 1915

Abstract

Read online

The enrichment of social media expression makes multimodal sentiment analysis a research hotspot. However, modality heterogeneity brings great difficulties to effective cross-modal fusion, especially the modality alignment problem and the uncontrolled vector offset during fusion. In this paper, we propose a bimodal multi-head attention network (BMAN) based on text and audio, which adaptively captures the intramodal utterance features and complex intermodal alignment relationships. Specifically, we first set two independent unimodal encoders to extract the semantic features within each modality. Considering that different modalities deserve different weights, we further built a joint decoder to fuse the audio information into the text representation, based on learnable weights to avoid an unreasonable vector offset. The obtained cross-modal representation is used to improve the sentiment prediction performance. Experiments on both the aligned and unaligned CMU-MOSEI datasets show that our model achieves better performance than multiple baselines, and it has outstanding advantages in solving the problem of cross-modal alignment.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords