Image and Video Captioning for Apparels Using Deep Learning

Govind Agarwal; Kritika Jindal; Abishi Chowdhury; Vishal K. Singh; Amrit Pal

doi:10.1109/ACCESS.2024.3443422

IEEE Access (Jan 2024)

Image and Video Captioning for Apparels Using Deep Learning

Govind Agarwal,
Kritika Jindal,
Abishi Chowdhury,
Vishal K. Singh,
Amrit Pal

Affiliations

Govind Agarwal: ORCiD; School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, India
Kritika Jindal: ORCiD; School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, India
Abishi Chowdhury: ORCiD; School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, India
Vishal K. Singh: ORCiD; School of Computer Science and Electronics Engineering, University of Essex, Colchester Campus, Colchester, U.K.
Amrit Pal: ORCiD; School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, India

DOI: https://doi.org/10.1109/ACCESS.2024.3443422
Journal volume & issue: Vol. 12
pp. 113138 – 113150

Abstract

Read online

In the rapidly evolving world of apparel, writing clear and interesting product descriptions is crucial to attract customers. In light of the importance of automated descriptions for apparel, this work explores the field of image captioning for apparel photos and expands its use to include captioning videos to enable visually impaired people to access and understand dynamic apparel content. To address the issue of diversity in datasets, we curated a collection of images that were divided into 26 classifications. With the use of Convolutional Neural Network (CNN) architectures like ConvNeXtLarge and Long Short-Term Memory (LSTM) architectures, our suggested system can automatically provide accurate and captivating captions for both still photos and moving videos that feature clothing. The LSTM network smoothly blends the visual data extracted by the CNN component from clothing photos and videos to produce captions that are both semantically and linguistically accurate. In addition, a YOLO model is included for real-time object detection, which makes it possible for the model to precisely identify and track several articles of clothing at once. The suggested architecture is evaluated using the BLEU score performance metric; research on the selected dataset yielded a BLEU-1 score of 0.983 for the ConvNeXtLarge-based model.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords