Controllable Image Captioning with Feature Refinement and Multilayer Fusion

Sen Du; Hong Zhu; Yujia Zhang; Dong Wang; Jing Shi; Nan Xing; Guangfeng Lin; Huiyu Zhou

doi:10.3390/app13085020

Applied Sciences (Apr 2023)

Controllable Image Captioning with Feature Refinement and Multilayer Fusion

Sen Du,
Hong Zhu,
Yujia Zhang,
Dong Wang,
Jing Shi,
Nan Xing,
Guangfeng Lin,
Huiyu Zhou

Affiliations

Sen Du: School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China
Hong Zhu: School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China
Yujia Zhang: School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China
Dong Wang: School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China
Jing Shi: School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China
Nan Xing: School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China
Guangfeng Lin: School of Printing, Packaging and Digital Media, Xi’an University of Technology, Xi’an 710054, China
Huiyu Zhou: School of Computing and Mathematical Sciences, University of Leicester, University Road, Leicester LE1 7RH, UK

DOI: https://doi.org/10.3390/app13085020
Journal volume & issue: Vol. 13, no. 8
p. 5020

Abstract

Read online

Image captioning is the task of automatically generating a description of an image. Traditional image captioning models tend to generate a sentence describing the most conspicuous objects, but fail to describe a desired region or object as human. In order to generate sentences based on a given target, understanding the relationships between particular objects and describing them accurately is central to this task. In detail, information-augmented embedding is used to add prior information to each object, and a new Multi-Relational Weighted Graph Convolutional Network (MR-WGCN) is designed for fusing the information of adjacent objects. Then, a dynamic attention decoder module selectively focuses on particular objects or semantic contents. Finally, the model is optimized by similarity loss. The experiment on MSCOCO Entities demonstrates that IANR obtains, to date, the best published CIDEr performance of 124.52% on the Karpathy test split. Extensive experiments and ablations on both the MSCOCO Entities and the Flickr30k Entities demonstrate the effectiveness of each module. Meanwhile, IANR achieves better accuracy and controllability than the state-of-the-art models under the widely used evaluation metric.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords