IEEE Access (Jan 2025)

Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook

  • Yang Qin,
  • Shuxue Ding,
  • Huiming Xie

DOI
https://doi.org/10.1109/access.2025.3541194
Journal volume & issue
Vol. 13
pp. 49922 – 49933

Abstract

Read online

Large-scale image and text representation learning is critical in determining the performance of multimodal tasks involving images and text, such as visual question answering and image captioning. Most existing research on large-scale image and text representation learning relies on Transformer networks for pre-training, i.e., learning generic semantic representations from large-scale image-to-text pairs. These representations are then fine-tuned and transferred to downstream multimodal tasks. This paper first provides a brief analysis of the advantages of pre-training models. It then comprehensively summarizes the relevant research on large-scale image and text representation learning based on pre-training. The focus is on pre-training model architectures, pre-training tasks, and image-text datasets. Finally, we provide a summary and outlook of large-scale image and text representation learning.

Keywords