IEEE Access (Jan 2024)

MMSFT: Multilingual Multimodal Summarization by Fine-Tuning Transformers

  • Siginamsetty Phani,
  • Ashu Abdul,
  • M. Krishna Siva Prasad,
  • Hiren Kumar Deva Sarma

DOI
https://doi.org/10.1109/ACCESS.2024.3454382
Journal volume & issue
Vol. 12
pp. 129673 – 129689

Abstract

Read online

Multilingual multimodal (MM) summarization, involving the processing of multimodal input (MI) data across multiple languages to generate corresponding multimodal summaries (MS) using a single model, has been under explored. MI data consists of text and associated images, while MS incorporates text alongside relevant images aligned with the MI context. In this paper, we propose an MM summarization model by fine-tuning transformers (MMSFT), focusing on low-resource languages (LRLs) such as the Indian languages. MMSFT comprises multilingual learning for encoder training, incorporating multilingual attention with a forget gate mechanism, followed by MS generation using a decoder. In the proposed approach, we use publicly available multilingual multimodal summarization dataset (M3LS). Evaluation utilizing ROUGE metrics and the language-agnostic target summary metric (LaTM) illustrates MMSFT’s significant enhancement over existing MM summarization models like mT5 and VG-mT5. Furthermore, MMSFT yields better or equivalent summaries compared to existing MM summarization models trained separately for each language. Human and statistical evaluation reveal MMSFT’s significant improvement over existing models, with a p-value $\leq 0.05$ in paired t-tests.

Keywords