IEEE Access (Jan 2024)

Attention-Guided Hierarchical Parsing for Fine-Grained Person-Centric Image Captioning

  • Zhengcheng Gu,
  • Jing Jin

DOI
https://doi.org/10.1109/ACCESS.2024.3416207
Journal volume & issue
Vol. 12
pp. 86293 – 86301

Abstract

Read online

Although significant progress in the task of producing fine-grained captions for portrait images has been made by the current models for generating detailed descriptions in captions, they still face challenges in attention allocation and in capturing the detailed characteristics of the subjects. This results in a difficulty to accurately generate refined captions for character images. In response to this issue, a model named Attention-guided Hierarchical Parsing (AHP) is innovatively proposed by us. This model leverages the exceptional segmentation performance of the Segment Anything Model (SAM) to guide the model to prioritize key information in character images, maintaining focus on the subject even in complex scenes. Additionally, the model utilizes a multi-level image feature encoding-decoding framework, significantly enhancing its capacity to capture intricate image details through a thorough analysis of multi-scale features within images. Extensive experimental results demonstrate the superior performance of the proposed model in generating fine-grained, high-quality captions, significantly improving the quality of image caption generation and introducing new perspectives and methods to the field of fine-grained image caption generation.

Keywords