MCR: Multilayer cross‐fusion with reconstructor for multimodal abstractive summarisation

Jingshu Yuan; Jing Yun; Bofei Zheng; Lei Jiao; Limin Liu

doi:10.1049/cvi2.12173

IET Computer Vision (Jun 2023)

MCR: Multilayer cross‐fusion with reconstructor for multimodal abstractive summarisation

Jingshu Yuan,
Jing Yun,
Bofei Zheng,
Lei Jiao,
Limin Liu

Affiliations

Jingshu Yuan: College of Data Science and Application Inner Mongolia University of Technology Huhhot China
Jing Yun: College of Data Science and Application Inner Mongolia University of Technology Huhhot China
Bofei Zheng: College of Data Science and Application Inner Mongolia University of Technology Huhhot China
Lei Jiao: College of Data Science and Application Inner Mongolia University of Technology Huhhot China
Limin Liu: College of Data Science and Application Inner Mongolia University of Technology Huhhot China

DOI: https://doi.org/10.1049/cvi2.12173
Journal volume & issue: Vol. 17, no. 4
pp. 389 – 403

Abstract

Read online

Abstract Multimodal abstractive summarisation (MAS) aims to generate a textual summary from multimodal data collection, such as video‐text pairs. Despite the success of recent work, the existing methods lack a thorough analysis for consistency across multimodal data. Besides, previous work relies on the fusion method to extract multimodal semantics, neglecting the constraints for complementary semantics of each modality. To address those issues, a multilayer cross‐fusion model with the reconstructor for the MAS task is proposed. Their model could thoroughly conduct cross‐fusion for each modality via layers of cross‐modal transformer blocks, resulting in cross‐modal fusion representations with consistency across modalities. Then the reconstructor is employed to reproduce source modalities based on cross‐modal fusion representations. The reconstruction process constrains the fusion representations with the complementary semantics of each modality. Comprehensive comparison and ablation experiments on the open domain multimodal dataset How2 are proposed. The results empirically verify the effectiveness of the multilayer cross‐fusion with the reconstructor structure on the proposed model.

Published in IET Computer Vision

ISSN: 1751-9632 (Print); 1751-9640 (Online)
Publisher: Wiley
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://ietresearch.onlinelibrary.wiley.com/journal/17519640

About the journal

Abstract

Keywords