A comprehensive construction of deep neural network‐based encoder–decoder framework for automatic image captioning systems

Md Mijanur Rahman; Ashik Uzzaman; Sadia Islam Sami; Fatema Khatun; Md Al‐Amin Bhuiyan

doi:10.1049/ipr2.13287

IET Image Processing (Dec 2024)

A comprehensive construction of deep neural network‐based encoder–decoder framework for automatic image captioning systems

Md Mijanur Rahman,
Ashik Uzzaman,
Sadia Islam Sami,
Fatema Khatun,
Md Al‐Amin Bhuiyan

Affiliations

Md Mijanur Rahman: Department of Computer Science and Engineering Jatiya Kabi Kazi Nazrul Islam University, Trishal Mymensingh Bangladesh
Ashik Uzzaman: Department of Computer Science and Engineering Jatiya Kabi Kazi Nazrul Islam University, Trishal Mymensingh Bangladesh
Sadia Islam Sami: Department of Computer Science and Engineering Jatiya Kabi Kazi Nazrul Islam University, Trishal Mymensingh Bangladesh
Fatema Khatun: Department of Electrical and Electronic Engineering Bangabandhu Sheikh Mujibur Rahman Science & Technology University, Gopalganj Dhaka Bangladesh
Md Al‐Amin Bhuiyan: Department of Computer Engineering King Faisal University, Hofuf Al Ahsa Saudi Arabia

DOI: https://doi.org/10.1049/ipr2.13287
Journal volume & issue: Vol. 18, no. 14
pp. 4778 – 4798

Abstract

Read online

Abstract This study introduces a novel encoder–decoder framework based on deep neural networks and provides a thorough investigation into the field of automatic picture captioning systems. The suggested model uses a “long short‐term memory” decoder for word prediction and sentence construction, and a “convolutional neural network” as an encoder that is skilled at object recognition and spatial information retention. The long short‐term memory network functions as a sequence processor, generating a fixed‐length output vector for final predictions, while the VGG‐19 model is utilized as an image feature extractor. For both training and testing, the study uses a variety of photos from open‐access datasets, such as Flickr8k, Flickr30k, and MS COCO. The Python platform is used for implementation, with Keras and TensorFlow as backends. The experimental findings, which were assessed using the “bilingual evaluation understudy” metric, demonstrate the effectiveness of the suggested methodology in automatically captioning images. By addressing spatial relationships in images and producing logical, contextually relevant captions, the paper advances image captioning technology. Insightful ideas for future study directions are generated by the discussion of the difficulties faced during the experimentation phase. By establishing a strong neural network architecture for automatic picture captioning, this study creates opportunities for future advancement and improvement in the area.

Published in IET Image Processing

ISSN: 1751-9659 (Print); 1751-9667 (Online)
Publisher: Wiley
Country of publisher: United Kingdom
LCC subjects: Technology: Photography; Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://ietresearch.onlinelibrary.wiley.com/journal/17519667

About the journal

Abstract

Keywords