Mathematics (Jun 2025)

Noise Improves Multimodal Machine Translation: Rethinking the Role of Visual Context

  • Xinyu Ma,
  • Jun Rao,
  • Xuebo Liu

DOI
https://doi.org/10.3390/math13111874
Journal volume & issue
Vol. 13, no. 11
p. 1874

Abstract

Read online

Multimodal Machine Translation (MMT) has long been assumed to outperform traditional text-only MT by leveraging visual information. However, recent studies challenge this assumption, showing that MMT models perform similarly even when tested without images or with mismatched images. This raises fundamental questions about the actual utility of visual information in MMT, which this work aims to investigate. We first revisit commonly used image-must and image-free MMT approaches, identifying that suboptimal performance may stem from insufficiently robust baseline models. To further examine the role of visual information, we propose a novel visual type regularization method and introduce two probing tasks—Visual Contribution Probing and Modality Relationship Probing—to analyze whether and how visual features influence a strong MMT model. Surprisingly, our findings on a mainstream dataset indicate that the gains from visual information are marginal. We attribute this improvement primarily to a regularization effect, which can be replicated using random noise. Our results suggest that the MMT community should critically re-evaluate baseline models, evaluation metrics, and dataset design to advance multimodal learning meaningfully.

Keywords