IEEE Access (Jan 2024)

MVItem: A Benchmark for Multi-View Cross-Modal Item Retrieval

  • Bo Li,
  • Jiansheng Zhu,
  • Linlin Dai,
  • Hui Jing,
  • Zhizheng Huang,
  • Yuteng Sui

DOI
https://doi.org/10.1109/ACCESS.2024.3447872
Journal volume & issue
Vol. 12
pp. 119563 – 119576

Abstract

Read online

The existing text-image pre-training models have demonstrated strong generalization capabilities, however, their performance of item retrieval in real-world scenarios still falls short of expectations. In order to optimize the performance of text-image pre-training model to retrieve items in real scenarios, we present a benchmark called MVItem for exploring multi-view item retrieval based on the open-source dataset MVImgNet. Firstly, we evenly sample items in MVImgNet to obtain 5 images from different views, and automatically annotate this images based on MiniGPT-4. Subsequently, through manual cleaning and comparison, we present a high-quality textual description for each sample. Then, in order to investigate the spatial misalignment problem of item retrieval in real-world scenarios and mitigate the impact of spatial misalignment on retrieval, we devise a multi-view feature fusion strategy and propose a cosine distance balancing method based on Sequential Least Squares Programming (SLSQP) to achieve the fusion of multiple view vectors, namely balancing cosine distance(BCD). On this basis, we select the representative state-of-the-art text-image pre-training retrieval models as baselines, and establish multiple test groups to explore the effectiveness of multi-view information on item retrieval to easing potential spatial misalignment. The experimental results show that the retrieval of fusing multi-view features is generally better than that of the baseline, indicating that multi-view feature fusion is helpful to alleviate the impact of spatial misalignment on item retrieval. Moreover, the proposed feature fusion, balancing cosine distance(BCD), is generally better than that of feature averaging, denoted as balancing euclidean distance(BED) in this work. At the results, we find that the fusion of multiple images with different views is more helpful for text-to-image (T2I) retrieval, and the fusion of a small number of images with large differences in views is more helpful for image-to-image (I2I) retrieval.

Keywords