ClusterE-ZSL: A Novel Cluster-Based Embedding for Enhanced Zero-Shot Learning in Contrastive Pre-Training Cross-Modal Retrieval

Umair Tariq; Zonghai Hu; Khawaja Tauseef Tasneem; Md Belal Bin Heyat; Muhammad Shahid Iqbal; Kamran Aziz

doi:10.1109/ACCESS.2024.3476082

IEEE Access (Jan 2024)

ClusterE-ZSL: A Novel Cluster-Based Embedding for Enhanced Zero-Shot Learning in Contrastive Pre-Training Cross-Modal Retrieval

Umair Tariq,
Zonghai Hu,
Khawaja Tauseef Tasneem,
Md Belal Bin Heyat,
Muhammad Shahid Iqbal,
Kamran Aziz

Affiliations

Umair Tariq: ORCiD; School of Electronic Engineering, Beijing University of Posts and Telecommunication, Beijing, China
Zonghai Hu: ORCiD; School of Electronic Engineering, Beijing University of Posts and Telecommunication, Beijing, China
Khawaja Tauseef Tasneem: ORCiD; Information Technology Department, College of Computing and Informatics, Saudi Electronic University, Riyadh, Saudi Arabia
Md Belal Bin Heyat: ORCiD; CenBRAIN Neurotech Center of Excellence, School of Engineering, Westlake University, Hangzhou, Zhejiang, China
Muhammad Shahid Iqbal: ORCiD; School of Computer Science and Technology, Anhui University, Hefei, China
Kamran Aziz: ORCiD; Laboratory of Aerospace Information Security and Trusted Computing Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China

DOI: https://doi.org/10.1109/ACCESS.2024.3476082
Journal volume & issue: Vol. 12
pp. 162622 – 162637

Abstract

Read online

Zero-shot learning (ZSL) in a multi-model environment presents significant challenges and opportunities for improving cross-modal retrieval and object detection in unseen data. This study introduced a novel embedding approach of vector space clustering to address image-to-text and text-to-image retrieval problems effectively. We proposed an iterative training strategy; unlike the CLIP model, which directly compares visual and textual modalities, our model concatenates by clustering trained image and text features in common vector space. We use cross-modal contrastive and multi-stage contrast loss to improve the unsupervised learning of our model. This integration makes it possible to achieve proper clustering on embedding, which enhances the image-text matching problem in zero-shot learning tasks. We rigorously evaluate our model performance on standard benchmark datasets, including Flickr30K, Flickr8K, and MSCOCO 5K, achieving notable improvements with accuracies of 91.3%, 88.8%, and 90.3%, respectively. The results demonstrate the better performance of our model over existing methods but also show its effectiveness in enhancing cross-modal retrieval in zero-shot learning.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords