Multi-channel Temporal Convolution Fusion for Multimodal Sentiment Analysis

SUN Jie, CHE Wengang, GAO Shengxiang

doi:10.3778/j.issn.1673-9418.2309071

Jisuanji kexue yu tansuo (Nov 2024)

Multi-channel Temporal Convolution Fusion for Multimodal Sentiment Analysis

SUN Jie, CHE Wengang, GAO Shengxiang

Affiliations

SUN Jie, CHE Wengang, GAO Shengxiang: School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China

DOI: https://doi.org/10.3778/j.issn.1673-9418.2309071
Journal volume & issue: Vol. 18, no. 11
pp. 3041 – 3050

Abstract

Read online

Multimodal sentiment analysis has become a hot research direction in affective computing by extending unimodal analysis to multimodal environments with information fusion. Word-level representation fusion is a key technique for modeling cross-modal interactions by capturing interplay between different modal elements. And word-level representation fusion faces two main challenges: local interactions between modal elements and global interactions along the temporal dimension. Existing methods often adopt attention mechanisms to model correlations between overall features across modalities when modeling local interactions, while ignoring interactions between adjacent elements and local features, and are computationally expensive. To address these issues, a multi-channel temporal convolution fusion (MCTCF) model is proposed, which uses 2D convolutions to obtain local interactions between modal elements. Specifically, local connections can capture associations between neighboring elements, multi-channel convolutions learn to fuse local features across modalities, and weight sharing greatly reduces computations. On the locally fused sequences, temporal LSTM networks further model global correlations along the temporal dimension. Extensive experiments on MOSI and MOSEI datasets demonstrate the efficacy and efficiency of MCTCF. Using just one convolution kernel (three channels, 28 weight parameters), it achieves state-of-the-art or competitive results on many metrics. Ablation studies confirm that both local convolution fusion and global temporal modeling are crucial for the superior performance. In summary, this paper enhances word-level representation fusion through feature interactions, and reduces computational complexity.

multimodal; sentiment analysis; word-level representation fusion; 2d convolutional neural network

Published in Jisuanji kexue yu tansuo

ISSN: 1673-9418 (Print)
Publisher: Journal of Computer Engineering and Applications Beijing Co., Ltd., Science Press
Country of publisher: China
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://fcst.ceaj.org

About the journal

Abstract

Keywords