Semantic-Enhanced Cross-Modal Fusion for Improved Unsupervised Image Captioning

Nan Xiang; Ling Chen; Leiyan Liang; Xingdi Rao; Zehao Gong

doi:10.3390/electronics12173549

Electronics (Aug 2023)

Semantic-Enhanced Cross-Modal Fusion for Improved Unsupervised Image Captioning

Nan Xiang,
Ling Chen,
Leiyan Liang,
Xingdi Rao,
Zehao Gong

Affiliations

Nan Xiang: Liangjiang International College, Chongqing University of Technology, Chongqing 401135, China
Ling Chen: Liangjiang International College, Chongqing University of Technology, Chongqing 401135, China
Leiyan Liang: Liangjiang International College, Chongqing University of Technology, Chongqing 401135, China
Xingdi Rao: Liangjiang International College, Chongqing University of Technology, Chongqing 401135, China
Zehao Gong: Liangjiang International College, Chongqing University of Technology, Chongqing 401135, China

DOI: https://doi.org/10.3390/electronics12173549
Journal volume & issue: Vol. 12, no. 17
p. 3549

Abstract

Read online

Unsupervised image captioning often grapples with challenges such as image–text mismatches and modality gaps, resulting in suboptimal captions. This paper introduces a semantic-enhanced cross-modal fusion model (SCFM) to address these issues. The SCFM integrates three innovative components: a text semantic enhancement network (TSE-Net) for nuanced semantic representation; contrast learning for optimizing similarity measures between text and images; and enhanced visual selection decoding (EVSD) for precise captioning. Unlike existing methods that struggle with capturing accurate semantic relationships and flexibility across scenarios, the proposed model provides a robust solution for unbiased and diverse captioning. Through experimental evaluations on the MS COCO and Flickr30k datasets, SCFM demonstrates significant improvements over the benchmark model, enhancing the CIDEr and BLEU-4 metrics by 3.6% and 3.2%, respectively. Visualization analysis further reveals the model’s superiority in increasing variability between hidden features and its potential in cross-domain and stylized image captioning. The findings not only contribute to the advancement of image captioning techniques but also open avenues for future research. Further investigations will explore SCFM’s adaptability to other multimodal tasks and refine it for more intricate image–text relationships.

Published in Electronics

ISSN: 2079-9292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics
Website: http://www.mdpi.com/journal/electronics

About the journal

Abstract

Keywords