Transformer-Based Discriminative and Strong Representation Deep Hashing for Cross-Modal Retrieval

Suqing Zhou; Yu Han; Ning Chen; Siyu Huang; Kostromitin Konstantin Igorevich; Jia Luo; Peiying Zhang

doi:10.1109/ACCESS.2023.3339581

IEEE Access (Jan 2023)

Transformer-Based Discriminative and Strong Representation Deep Hashing for Cross-Modal Retrieval

Suqing Zhou,
Yu Han,
Ning Chen,
Siyu Huang,
Kostromitin Konstantin Igorevich,
Jia Luo,
Peiying Zhang

Affiliations

Suqing Zhou: Internet of Things and Artificial Intelligence College, Fujian Polytechnic of Information Technology, Fuzhou, China
Yu Han: ORCiD; China Mobile Group Shandong Company Ltd., Jinan, China
Ning Chen: ORCiD; Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao, China
Siyu Huang: Xiongan Institute of Innovation, Chinese Academy of Sciences, Baoding, China
Kostromitin Konstantin Igorevich: ORCiD; Department of Physics of Nanoscale Systems, South Ural State University, Chelyabinsk, Russia
Jia Luo: College of Economics and Management, Beijing University of Technology, Beijing, China
Peiying Zhang: ORCiD; Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao, China

DOI: https://doi.org/10.1109/ACCESS.2023.3339581
Journal volume & issue: Vol. 11
pp. 140041 – 140055

Abstract

Read online

Cross-modal hashing retrieval has attracted extensive attention due to its low storage requirements as well as high retrieval efficiency. In particular, how to more fully exploit the correlation of different modality data and generate a more distinguished representation is the key to improving the performance of this method. Moreover, Transformer-based models have been widely used in various fields, including natural language processing, due to their powerful contextual information processing capabilities. Based on these motivations, we propose a Transformer-based Distinguishing Strong Representation Deep Hashing (TDSRDH). For text modality, since the sequential relations between words imply semantic relations that are not independent relations, we thoughtfully encode them using a transformer-based encoder to obtain a strong representation. In addition, we propose a triple-supervised loss based on the commonly used pairwise loss and quantization loss. The latter two ensure the learned features and hash-codes can preserve the similarity of the original data during the learning process. The former ensures that the distance between similar instances is closer and the distance between dissimilar instances is farther. So that TDSRDH can generate more discriminative representations while preserving the similarity between modalities. Finally, experiments on the three datasets MIRFLICKR-25K, IAPR TC-12, and NUS-WIDE demonstrated the superiority of TDSRDH over the other baselines. Moreover, the effectiveness of the proposed idea was demonstrated by ablation experiments.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords