Deep Learning Approaches Based on Transformer Architectures for Image Captioning Tasks

Roberto Castro; Israel Pineda; Wansu Lim; Manuel Eugenio Morocho-Cayamcela

doi:10.1109/ACCESS.2022.3161428

IEEE Access (Jan 2022)

Deep Learning Approaches Based on Transformer Architectures for Image Captioning Tasks

Roberto Castro,
Israel Pineda,
Wansu Lim,
Manuel Eugenio Morocho-Cayamcela

Affiliations

Roberto Castro: ORCiD; Deep Learning for Autonomous Driving, Robotics, and Computer Vision Research Group (DeepARC Research), School of Mathematical and Computational Sciences, Yachay Tech University, Urcuquí, Ecuador
Israel Pineda: ORCiD; Deep Learning for Autonomous Driving, Robotics, and Computer Vision Research Group (DeepARC Research), School of Mathematical and Computational Sciences, Yachay Tech University, Urcuquí, Ecuador
Wansu Lim: ORCiD; Department of Aeronautics, Mechanical and Electronic Convergence Engineering, Future Communications and Systems Laboratory (FCSL), Kumoh National Institute of Technology, Gumi-si, Republic of Korea
Manuel Eugenio Morocho-Cayamcela: ORCiD; Deep Learning for Autonomous Driving, Robotics, and Computer Vision Research Group (DeepARC Research), School of Mathematical and Computational Sciences, Yachay Tech University, Urcuquí, Ecuador

DOI: https://doi.org/10.1109/ACCESS.2022.3161428
Journal volume & issue: Vol. 10
pp. 33679 – 33694

Abstract

Read online

This paper focuses on visual attention, a state-of-the-art approach for image captioning tasks within the computer vision research area. We study the impact that different hyperparemeter configurations on an encoder-decoder visual attention architecture in terms of efficiency. Results show that the correct selection of both the cost function and the gradient-based optimizer can significantly impact the captioning results. Our system considers the cross-entropy, Kullback-Leibler divergence, mean squared error, and negative log-likelihood loss functions; the adaptive momentum (Adam), AdamW, RMSprop, stochastic gradient descent, and Adadelta optimizers. Experimentation shows that a combination of cross-entropy with Adam is the best alternative returning a Top-5 accuracy value of 73.092 and a BLEU-4 value of 20.10. Furthermore, a comparative analysis of alternative convolutional architectures demonstrated their performance as an encoder. Our results show that ResNext-101 stands out with a Top-5 accuracy of 73.128 and a BLEU-4 of 19.80; positioning itself as the best option when looking for the optimum captioning quality. However, MobileNetV3 proved to be a much more compact alternative with 2,971,952 parameters and 0.23 Giga fixed-point Multiply-Accumulate operations per Second (GMACS). Consequently, MobileNetV3 offers a competitive output quality at the cost of lower computational performance, supported by values of 19.50 and 72.928 for the BLEU-4 and Top-5 accuracy, respectively. Finally, when testing vision transformer (ViT), and data-efficient image transformer (DeiT) models to replace the convolutional component of the architecture, DeiT achieved an improvement over ViT, obtaining a value of 34.44 in the BLEU-4 metric.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords