IET Computer Vision (Aug 2024)

Context‐aware relation enhancement and similarity reasoning for image‐text retrieval

  • Zheng Cui,
  • Yongli Hu,
  • Yanfeng Sun,
  • Baocai Yin

DOI
https://doi.org/10.1049/cvi2.12270
Journal volume & issue
Vol. 18, no. 5
pp. 652 – 665

Abstract

Read online

Abstract Image‐text retrieval is a fundamental yet challenging task, which aims to bridge a semantic gap between heterogeneous data to achieve precise measurements of semantic similarity. The technique of fine‐grained alignment between cross‐modal features plays a key role in various successful methods that have been proposed. Nevertheless, existing methods cannot effectively utilise intra‐modal information to enhance feature representation and lack powerful similarity reasoning to get a precise similarity score. Intending to tackle these issues, a context‐aware Relation Enhancement and Similarity Reasoning model, called RESR, is proposed, which conducts both intra‐modal relation enhancement and inter‐modal similarity reasoning while considering the global‐context information. For intra‐modal relation enhancement, a novel context‐aware graph convolutional network is introduced to enhance local feature representations by utilising relation and global‐context information. For inter‐modal similarity reasoning, local and global similarity features are exploited by the bidirectional alignment of image and text, and the similarity reasoning is implemented among multi‐granularity similarity features. Finally, refined local and global similarity features are adaptively fused to get a precise similarity score. The experimental results show that our effective model outperforms some state‐of‐the‐art approaches, achieving average improvements of 2.5% and 6.3% in R@sum on the Flickr30K and MS‐COCO dataset.

Keywords