Jisuanji kexue (Apr 2023)

Text-Image Cross-modal Retrieval Based on Transformer

  • YANG Xiaoyu, LI Chao, CHEN Shunyao, LI Haoliang, YIN Guangqiang

DOI
https://doi.org/10.11896/jsjkx.220100083
Journal volume & issue
Vol. 50, no. 4
pp. 141 – 148

Abstract

Read online

With the growth of Internet multimedia data,text image retrieval has become a research hotspot.In image and text retrieval,the mutual attention mechanism is used to achieve better image-text matching results by interacting image and text features.However,this method cannot obtain image features and text features separately,and requires interaction of image and text features in the later stage of large-scale retrieval,which consumes a lot of time and is not able to achieve fast retrieval and ma-tching.However,the cross-modal image text feature learning based on Transformer has achieved good results and has received more and more attention from researchers.This paper designs a novel Transformer-based text image retrieval network structure(HAS-Net),which mainly has the following improvements:a hierarchical Transformer coding structure is designed to better utilize the underlying grammatical information and high-level semantic information;the traditional global feature aggregation method is improved,and the self-attention mechanism is used to design a new feature aggregation method;by sharing the Transformer coding layer,image features and text features are mapped to a common feature coding space.Finally,experiments are conducted on the MS-COCO and Flickr30k datasets,the cross-modal retrieval performance has been improved,and it is in a leading position among similar algorithms.It is proved that the designed network structure is effective.

Keywords