IEEE Access (Jan 2024)

Image and Video Captioning for Apparels Using Deep Learning

  • Govind Agarwal,
  • Kritika Jindal,
  • Abishi Chowdhury,
  • Vishal K. Singh,
  • Amrit Pal

DOI
https://doi.org/10.1109/ACCESS.2024.3443422
Journal volume & issue
Vol. 12
pp. 113138 – 113150

Abstract

Read online

In the rapidly evolving world of apparel, writing clear and interesting product descriptions is crucial to attract customers. In light of the importance of automated descriptions for apparel, this work explores the field of image captioning for apparel photos and expands its use to include captioning videos to enable visually impaired people to access and understand dynamic apparel content. To address the issue of diversity in datasets, we curated a collection of images that were divided into 26 classifications. With the use of Convolutional Neural Network (CNN) architectures like ConvNeXtLarge and Long Short-Term Memory (LSTM) architectures, our suggested system can automatically provide accurate and captivating captions for both still photos and moving videos that feature clothing. The LSTM network smoothly blends the visual data extracted by the CNN component from clothing photos and videos to produce captions that are both semantically and linguistically accurate. In addition, a YOLO model is included for real-time object detection, which makes it possible for the model to precisely identify and track several articles of clothing at once. The suggested architecture is evaluated using the BLEU score performance metric; research on the selected dataset yielded a BLEU-1 score of 0.983 for the ConvNeXtLarge-based model.

Keywords