VATMAN: Integrating Video-Audio-Text for Multimodal Abstractive SummarizatioN via Crossmodal Multi-Head Attention Fusion

Doosan Baek; Jiho Kim; Hongchul Lee

doi:10.1109/ACCESS.2024.3447737

IEEE Access (Jan 2024)

VATMAN: Integrating Video-Audio-Text for Multimodal Abstractive SummarizatioN via Crossmodal Multi-Head Attention Fusion

Doosan Baek,
Jiho Kim,
Hongchul Lee

Affiliations

Doosan Baek: ORCiD; School of Industrial Management Engineering, Korea University, Seoul, Republic of Korea
Jiho Kim: School of Industrial Management Engineering, Korea University, Seoul, Republic of Korea
Hongchul Lee: ORCiD; School of Industrial Management Engineering, Korea University, Seoul, Republic of Korea

DOI: https://doi.org/10.1109/ACCESS.2024.3447737
Journal volume & issue: Vol. 12
pp. 119174 – 119184

Abstract

Read online

The paper introduces VATMAN (Video-Audio-Text Multimodal Abstractive summarizatioN), a novel approach for generating hierarchical multimodal summaries utilizing Trimodal Hierarchical Multi-head Attention. Unlike existing generative pre-trained language models, VATMAN employs a hierarchical attention mechanism that hierarchically attends to visual, audio, and text modalities. However, in the existing literature, there is a lack of cross-modal attention at the block level. In light of this, we propose a block-level cross-modal attention mechanism, termed Blockwise Cross-modal Multi-head Attention (BCMA), to enhance the summarization performance. This attention mechanism enables the model to simultaneously capture context information from visual, audio, and text modalities, providing a more comprehensive understanding of the input data. In terms of performance, our VATMAN model outperforms the state-of-the-art trimodal model based on RNN in the How2 dataset. Specifically, it achieves a Rouge-1 improvement of 7.53% and Rouge-L improvement of 2.19%, demonstrating superior summarization quality. In addition, compared to uni-modal and di-modal baseline transformer models, VATMAN exhibits significant improvements in Rouge-L scores by 11.12% and 3.85%, respectively, highlighting its effectiveness in capturing hierarchical relationships across modalities. Furthermore, we evaluated our generated abstractive summaries using various metrics, including BLEU, METEOR, CIDEr, ContentF1, and BERTScore. Our proposed model consistently outperformed others across most metrics, demonstrating its effective performance in qualitative assessments.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords