Caps Captioning: A Modern Image Captioning Approach Based on Improved Capsule Network

Shima Javanmardi; Ali Mohammad Latif; Mohammad Taghi Sadeghi; Mehrdad Jahanbanifard; Marcello Bonsangue; Fons J. Verbeek

doi:10.3390/s22218376

Sensors (Nov 2022)

Caps Captioning: A Modern Image Captioning Approach Based on Improved Capsule Network

Shima Javanmardi,
Ali Mohammad Latif,
Mohammad Taghi Sadeghi,
Mehrdad Jahanbanifard,
Marcello Bonsangue,
Fons J. Verbeek

Affiliations

Shima Javanmardi: Computer Engineering Department, Yazd University, Yazd P.O. Box 8915818411, Iran
Ali Mohammad Latif: Computer Engineering Department, Yazd University, Yazd P.O. Box 8915818411, Iran
Mohammad Taghi Sadeghi: Electrical Engineering Department, Yazd University, Yazd P.O. Box 89195741, Iran
Mehrdad Jahanbanifard: Section Imaging and Bioinformatics, Leiden Institute of Advanced Computer Science (LIACS), Leiden University, Niels Bohrweg 1, 2333 CA Leiden, The Netherlands
Marcello Bonsangue: Section Imaging and Bioinformatics, Leiden Institute of Advanced Computer Science (LIACS), Leiden University, Niels Bohrweg 1, 2333 CA Leiden, The Netherlands
Fons J. Verbeek: Section Imaging and Bioinformatics, Leiden Institute of Advanced Computer Science (LIACS), Leiden University, Niels Bohrweg 1, 2333 CA Leiden, The Netherlands

DOI: https://doi.org/10.3390/s22218376
Journal volume & issue: Vol. 22, no. 21
p. 8376

Abstract

Read online

In image captioning models, the main challenge in describing an image is identifying all the objects by precisely considering the relationships between the objects and producing various captions. Over the past few years, many methods have been proposed, from an attribute-to-attribute comparison approach to handling issues related to semantics and their relationships. Despite the improvements, the existing techniques suffer from inadequate positional and geometrical attributes concepts. The reason is that most of the abovementioned approaches depend on Convolutional Neural Networks (CNNs) for object detection. CNN is notorious for failing to detect equivariance and rotational invariance in objects. Moreover, the pooling layers in CNNs cause valuable information to be lost. Inspired by the recent successful approaches, this paper introduces a novel framework for extracting meaningful descriptions based on a parallelized capsule network that describes the content of images through a high level of understanding of the semantic contents of an image. The main contribution of this paper is proposing a new method that not only overrides the limitations of CNNs but also generates descriptions with a wide variety of words by using Wikipedia. In our framework, capsules focus on the generation of meaningful descriptions with more detailed spatial and geometrical attributes for a given set of images by considering the position of the entities as well as their relationships. Qualitative experiments on the benchmark dataset MS-COCO show that our framework outperforms state-of-the-art image captioning models when describing the semantic content of the images.

Published in Sensors

ISSN: 1424-8220 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Chemical technology
Website: http://www.mdpi.com/journal/sensors

About the journal

Abstract

Keywords