Image Captioning With Positional and Geometrical Semantics

Anwar Ul Haque; Sayeed Ghani; Muhammad Saeed

doi:10.1109/ACCESS.2021.3131343

IEEE Access (Jan 2021)

Image Captioning With Positional and Geometrical Semantics

Anwar Ul Haque,
Sayeed Ghani,
Muhammad Saeed

Affiliations

Anwar Ul Haque: Institute of Business Administration, Karachi, Pakistan
Sayeed Ghani: ORCiD; Institute of Business Administration, Karachi, Pakistan
Muhammad Saeed: ORCiD; Department of Computer Science, University of Karachi, Karachi, Pakistan

DOI: https://doi.org/10.1109/ACCESS.2021.3131343
Journal volume & issue: Vol. 9
pp. 160917 – 160925

Abstract

Read online

The last 5 to 6 years have seen tremendous progress in automatic image captioning using deep learning. Initial research focused on the attribute-to-attribute comparison of image features and texts to describe the image as a sentence, the current research is handling issues related to semantics and correlations. However, current state of art research suffers from insufficient concepts when it comes to positional and geometrical attributes. The majority of research relying on CNN’s (Convolutional Neural Networks) for object feature extractions has no clue about equivariance and rotational invariance which leads towards the orientation-less understanding of objects for captioning along with longer training time, and larger dataset. Furthermore, CNN’s based image captioning encoders also fail to understand the geometrical alignment of object attributes within the image and hence mislabels distorted as correct. To cater to the above issues, we propose ICPS (Image Captioning with Positional and Geometrical semantics) a capsule network-based image captioning technique along with transformer neural networks as the decoder. The proposed ICPS architecture handles various geometrical properties of image objects with the help of parallelized capsules while the object-to-text decoding is done by Transformer Neural Networks. The inclusion of cluster capsules provides better object understanding in terms of position, equivariance, and geometrical orientation with more augmented object understanding over a small dataset in comparatively less time. The extracted image features provide a better understanding of image objects and help the decoding stage to narrate effectively with positional and geometrical details. We trained and tested our ICPS over the Flickr8k dataset and found ourselves to be better at captioning when it comes to describing the positional and geometrical transitions as compared to other current state-of-the-art research.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords