IEEE Access (Jan 2024)
A Study of ConvNeXt Architectures for Enhanced Image Captioning
Abstract
This study explores the effectiveness of the ConvNeXt model, an advanced computer vision architecture, in the task of image captioning. We integrated ConvNeXt with a Long Short-Term Memory network that includes a visual attention module, focusing on assessing its performance across different scenarios. Experiments were conducted using various ConvNeXt versions for feature extraction, different learning rates during the training phase were tested, and the impact of including or excluding teacher-forcing was analyzed. The MS COCO 2014 dataset was employed, with top-5 accuracy and BLEU metrics used to evaluate performance. The implementation of ConvNeXt in image captioning systems reveals notable performance enhancements. In terms of BLEU-4 scores, ConvNeXt outperformed existing benchmarks by 43.04% for models using soft-attention and by 39.04% for those with hard-attention mechanisms. Furthermore, ConvNeXt surpassed models based on vision transformers and data-efficient image transformers by 4.57% and 0.93%, respectively, in BLEU-4 scores. When compared with systems using encoders such as ResNet-101, ResNet-152, VGG-16, ResNeXt-101, and MobileNet V3, ConvNeXt achieved higher top-5 accuracy improvements of 6.44%, 6.46%, 6.47%, 6.39%, and 6.68%, and reduced loss by 18.46%, 18.44%, 18.46%, 18.24%, and 18.72%, respectively.
Keywords