Electronics (Aug 2023)

Semantic-Enhanced Cross-Modal Fusion for Improved Unsupervised Image Captioning

  • Nan Xiang,
  • Ling Chen,
  • Leiyan Liang,
  • Xingdi Rao,
  • Zehao Gong

DOI
https://doi.org/10.3390/electronics12173549
Journal volume & issue
Vol. 12, no. 17
p. 3549

Abstract

Read online

Unsupervised image captioning often grapples with challenges such as image–text mismatches and modality gaps, resulting in suboptimal captions. This paper introduces a semantic-enhanced cross-modal fusion model (SCFM) to address these issues. The SCFM integrates three innovative components: a text semantic enhancement network (TSE-Net) for nuanced semantic representation; contrast learning for optimizing similarity measures between text and images; and enhanced visual selection decoding (EVSD) for precise captioning. Unlike existing methods that struggle with capturing accurate semantic relationships and flexibility across scenarios, the proposed model provides a robust solution for unbiased and diverse captioning. Through experimental evaluations on the MS COCO and Flickr30k datasets, SCFM demonstrates significant improvements over the benchmark model, enhancing the CIDEr and BLEU-4 metrics by 3.6% and 3.2%, respectively. Visualization analysis further reveals the model’s superiority in increasing variability between hidden features and its potential in cross-domain and stylized image captioning. The findings not only contribute to the advancement of image captioning techniques but also open avenues for future research. Further investigations will explore SCFM’s adaptability to other multimodal tasks and refine it for more intricate image–text relationships.

Keywords