Jisuanji kexue yu tansuo (Nov 2024)
Multi-channel Temporal Convolution Fusion for Multimodal Sentiment Analysis
Abstract
Multimodal sentiment analysis has become a hot research direction in affective computing by extending unimodal analysis to multimodal environments with information fusion. Word-level representation fusion is a key technique for modeling cross-modal interactions by capturing interplay between different modal elements. And word-level representation fusion faces two main challenges: local interactions between modal elements and global interactions along the temporal dimension. Existing methods often adopt attention mechanisms to model correlations between overall features across modalities when modeling local interactions, while ignoring interactions between adjacent elements and local features, and are computationally expensive. To address these issues, a multi-channel temporal convolution fusion (MCTCF) model is proposed, which uses 2D convolutions to obtain local interactions between modal elements. Specifically, local connections can capture associations between neighboring elements, multi-channel convolutions learn to fuse local features across modalities, and weight sharing greatly reduces computations. On the locally fused sequences, temporal LSTM networks further model global correlations along the temporal dimension. Extensive experiments on MOSI and MOSEI datasets demonstrate the efficacy and efficiency of MCTCF. Using just one convolution kernel (three channels, 28 weight parameters), it achieves state-of-the-art or competitive results on many metrics. Ablation studies confirm that both local convolution fusion and global temporal modeling are crucial for the superior performance. In summary, this paper enhances word-level representation fusion through feature interactions, and reduces computational complexity.
Keywords