Soft Contrastive Cross-Modal Retrieval

Jiayu Song; Yuxuan Hu; Lei Zhu; Chengyuan Zhang; Jian Zhang; Shichao Zhang

doi:10.3390/app14051944

Applied Sciences (Feb 2024)

Soft Contrastive Cross-Modal Retrieval

Jiayu Song,
Yuxuan Hu,
Lei Zhu,
Chengyuan Zhang,
Jian Zhang,
Shichao Zhang

Affiliations

Jiayu Song: School of Computer Science and Engineering, Central South University, Changsha 410083, China
Yuxuan Hu: School of Computer Science and Engineering, Central South University, Changsha 410083, China
Lei Zhu: College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China
Chengyuan Zhang: College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
Jian Zhang: School of Computer Science and Engineering, Central South University, Changsha 410083, China
Shichao Zhang: School of Computer Science and Engineering, Central South University, Changsha 410083, China

DOI: https://doi.org/10.3390/app14051944
Journal volume & issue: Vol. 14, no. 5
p. 1944

Abstract

Read online

Cross-modal retrieval plays a key role in the Natural Language Processing area, which aims to retrieve one modality to another efficiently. Despite the notable achievements of existing cross-modal retrieval methodologies, the complexity of the embedding space increases with more complex models, leading to less interpretable and potentially overfitting representations. Most existing methods realize outstanding results based on datasets without any error or noise, but that is extremely ideal and leads to trained models lacking robustness. To solve these problems, in this paper, we propose a novel approach, Soft Contrastive Cross-Modal Retrieval (SCCMR), which integrates the deep cross-modal model with soft contrastive learning and smooth label cross-entropy learning to boost common subspace embedding and improve the generalizability and robustness of the model. To confirm the performance and effectiveness of SCCMR, we conduct extensive experiments comparing 12 state-of-the-art methods on three multi-modal datasets by using image–text retrieval as a showcase. The experimental results show that our proposed method outperforms the baselines.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords