IEEE Access (Jan 2022)

MEDT: Using Multimodal Encoding-Decoding Network as in Transformer for Multimodal Sentiment Analysis

  • Qingfu Qi,
  • Liyuan Lin,
  • Rui Zhang,
  • Chengrong Xue

DOI
https://doi.org/10.1109/ACCESS.2022.3157712
Journal volume & issue
Vol. 10
pp. 28750 – 28759

Abstract

Read online

Multimodal sentiment analysis is a challenging task in the field of natural language processing (NLP). It uses multimodal signals (natural language, facial gestures, and acoustic behavior) in videos to generate emotional understanding. However, the importance of single modality data in the video to emotional outcomes is not static. With the extension of the time dimension, the emotional attributes of a specific natural language will be affected by non-natural language data, resulting in a vector shift in the feature space. At the same time, long-term dependencies within a specific modality and long-term dependencies between multiple modalities that are “unaligned” need to be considered. In response to the above problems, this paper proposes Multimodal Encoding-Decoding Network with Transformer. The network model encodes multimodal data through a Bidirectional Encoder Representations from Transformers (BERT) network and Transformer encoder to resolve long-term dependencies within modalities. And the network reconstructs the Transformer decoder to solve the weight problem of multimodal data in an iterative way. The network fully considers the long-term dependencies between modalities and the offset effect of non-natural language data on natural language data. Under the same experimental conditions, we validated our model on general multimodal sentiment analysis datasets. Compared with state-of-the-art models, the network achieves good progress and strong stability.

Keywords