IET Image Processing (Feb 2022)

A thorough review of models, evaluation metrics, and datasets on image captioning

  • Gaifang Luo,
  • Lijun Cheng,
  • Chao Jing,
  • Can Zhao,
  • Guozhu Song

DOI
https://doi.org/10.1049/ipr2.12367
Journal volume & issue
Vol. 16, no. 2
pp. 311 – 332

Abstract

Read online

Abstract Image captioning means generate descriptive sentences from a query image automatically. It has recently received widespread attention from the computer vision and natural language processing communities as an emerging visual task. Currently, both components have evolved considerably by exploiting object regions, attributes, attention mechanism methods, entity recognition with novelties, and training strategies. However, despite the impressive results, the research has not yet come to a conclusive answer. This survey aims to provide a comprehensive overview of image captioning methods, from technical architectures to benchmark datasets, evaluation metrics, and comparison of state‐of‐the‐art methods. In particular, image captioning methods are divided into different categories based on the technique adopted. Representative methods in each class are summarized, and their advantages and limitations are discussed. Moreover, many related state‐of‐the‐art studies were quantitatively compared to determine the recent trends and future directions in image captioning. The ultimate goal of this work is to serve as a tool for understanding the existing literature and highlighting future directions in the area of image captioning for Computer Vision and Natural Language Processing communities may benefit from.