IEEE Access (Jan 2024)

VATMAN: Integrating Video-Audio-Text for Multimodal Abstractive SummarizatioN via Crossmodal Multi-Head Attention Fusion

  • Doosan Baek,
  • Jiho Kim,
  • Hongchul Lee

DOI
https://doi.org/10.1109/ACCESS.2024.3447737
Journal volume & issue
Vol. 12
pp. 119174 – 119184

Abstract

Read online

The paper introduces VATMAN (Video-Audio-Text Multimodal Abstractive summarizatioN), a novel approach for generating hierarchical multimodal summaries utilizing Trimodal Hierarchical Multi-head Attention. Unlike existing generative pre-trained language models, VATMAN employs a hierarchical attention mechanism that hierarchically attends to visual, audio, and text modalities. However, in the existing literature, there is a lack of cross-modal attention at the block level. In light of this, we propose a block-level cross-modal attention mechanism, termed Blockwise Cross-modal Multi-head Attention (BCMA), to enhance the summarization performance. This attention mechanism enables the model to simultaneously capture context information from visual, audio, and text modalities, providing a more comprehensive understanding of the input data. In terms of performance, our VATMAN model outperforms the state-of-the-art trimodal model based on RNN in the How2 dataset. Specifically, it achieves a Rouge-1 improvement of 7.53% and Rouge-L improvement of 2.19%, demonstrating superior summarization quality. In addition, compared to uni-modal and di-modal baseline transformer models, VATMAN exhibits significant improvements in Rouge-L scores by 11.12% and 3.85%, respectively, highlighting its effectiveness in capturing hierarchical relationships across modalities. Furthermore, we evaluated our generated abstractive summaries using various metrics, including BLEU, METEOR, CIDEr, ContentF1, and BERTScore. Our proposed model consistently outperformed others across most metrics, demonstrating its effective performance in qualitative assessments.

Keywords