Journal of King Saud University: Computer and Information Sciences (Sep 2024)

A novel image captioning model with visual-semantic similarities and visual representations re-weighting

  • Alaa Thobhani,
  • Beiji Zou,
  • Xiaoyan Kui,
  • Asma A. Al-Shargabi,
  • Zaid Derea,
  • Amr Abdussalam,
  • Mohammed A. Asham

Journal volume & issue
Vol. 36, no. 7
p. 102127

Abstract

Read online

Image captioning, the task of generating descriptive sentences for images, has seen significant advancements by incorporating semantic information. However, previous studies employed semantic attribute detectors to extract predetermined attributes consistently applied at every time step, resulting in the use of irrelevant attributes to the linguistic context during words’ generation. Furthermore, the integration between semantic attributes and visual representations in previous works is considered superficial and ineffective, leading to the neglection of the rich visual-semantic connections affecting the captioning model’s performance. To address the limitations of previous models, we introduced a novel framework that adapts attribute usage based on contextual relevance and effectively utilizes the similarities between visual features and semantic attributes. Our framework includes an Attribute Detection Component (ADC) that predicts relevant attributes using visual features and attribute embeddings. The Attribute Prediction and Visual Weighting module (APVW) then dynamically adjusts these attributes and generates weights to refine the visual context vector, enhancing semantic alignment. Our approach demonstrated an average improvement of 3.30% in BLEU@1 and 5.24% in CIDEr on MS-COCO, and 6.55% in BLEU@1 and 25.72% in CIDEr on Flickr30K, during CIDEr optimization phase.

Keywords