IEEE Access (Jan 2020)

Enhancing Cross-Modal Retrieval Based on Modality-Specific and Embedding Spaces

  • Rintaro Yanagi,
  • Ren Togo,
  • Takahiro Ogawa,
  • Miki Haseyama

DOI
https://doi.org/10.1109/ACCESS.2020.2995815
Journal volume & issue
Vol. 8
pp. 96777 – 96786

Abstract

Read online

A new approach that drastically improves cross-modal retrieval performance in vision and language (hereinafter referred to as “vision and language retrieval”) is proposed in this paper. Vision and language retrieval takes data of one modality as a query to retrieve relevant data of another modality, and it enables flexible retrieval across different modalities. Most of the existing methods learn optimal embeddings of visual and lingual information to a single common representation space. However, we argue that the forced embedding optimization results in loss of key information for sentences and images. In this paper, we propose an effective utilization of representation spaces in a simple but robust vision and language retrieval method. The proposed method makes use of multiple individual representation spaces through text-to-image and image-to-text models. Experimental results showed that the proposed approach enhances the performance of existing methods that embed visual and lingual information to a single common representation space.

Keywords