Applied Sciences (Oct 2023)

A Review of Transformer-Based Approaches for Image Captioning

  • Oscar Ondeng,
  • Heywood Ouma,
  • Peter Akuon

DOI
https://doi.org/10.3390/app131911103
Journal volume & issue
Vol. 13, no. 19
p. 11103

Abstract

Read online

Visual understanding is a research area that bridges the gap between computer vision and natural language processing. Image captioning is a visual understanding task in which natural language descriptions of images are automatically generated using vision-language models. The transformer architecture was initially developed in the context of natural language processing and quickly found application in the domain of computer vision. Its recent application to the task of image captioning has resulted in markedly improved performance. In this paper, we briefly look at the transformer architecture and its genesis in attention mechanisms. We more extensively review a number of transformer-based image captioning models, including those employing vision-language pre-training, which has resulted in several state-of-the-art models. We give a brief presentation of the commonly used datasets for image captioning and also carry out an analysis and comparison of the transformer-based captioning models. We conclude by giving some insights into challenges as well as future directions for research in this area.

Keywords